Re: bluestore cache

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 27 May 2016 10:09:24 -0700

On Thu, May 26, 2016 at 12:27 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> Previously we were relying on the block device cache when appropriate
> (when a rados hint indicated we should, usually), but that is unreliable
> for annoying O_DIRECT reasons.  We also need an explicit cache to cope
> with some of the read/modify/write cases with compression and checksum
> block/chunk sizes.
>
> A basic cache is in place that is attached to each Onode, and maps the
> logical object bytes to buffers.  We also have a per-collection onode
> cache.  Currently the only trimming of data happens when onodes are
> trimmed, and we control that using a coarse per-collection num_onodes
> knob.
>
> There are two basic questions:
>
> 1. First, should we stick with a logical extent -> buffer mapping, or move
> to a blob -> blob extent mapping.  The former is simpler and attaches to
> the onode cache, which we also have to fix trimming for anyway.  On the
> other hand, when we clone an object (usually a head object to a snapshot),
> the cache doesn't follow.  Moving to a blob-based scheme would work better
> in that regard, but probably means that we have another object
> (bluestore_blob_t) whose lifecycle we need to manage.
>
> I'm inclined to stick with the current scheme since reads from just-cloned
> snaps probably aren't too frequent, at least until we have a better
> idea how to do the lifecycle with the simpler (current) model.
>
> 2. Second, we need to trim the buffers.  The per-collection onode cache is
> nice because the LRU is local to the collection and already protected by
> existing locks, which avoids complicated locking in the trim path that
> we'd get from a global LRU.  On the other hand, it's clearly suboptimal:
> some pools will get more IO than others and we want to apportion our cache
> resources more fairly.
>
> My hope is that we can do both using a scheme that has collection-local
> LRU (or ARC or some other cache policy) for onodes and buffers, and then
> have a global view of what proportion of the cache a collection is
> entitled to and drive our trimming against that.  This won't be super
> precise, but I wouldn't even if we fudge where the "perfect"
> per-collection cache size is by say 30% I wouldn't expect to see a huge
> cache hit range change over that range (unless your cache is already
> pretty undersized).
>
> Anyway, a simple way to estimate a collections portion of the total might
> be to have a periodic global epoch counter that increments every, say, 10
> seconds.  Then we could ops globally and per collection.  When the epoch
> rolls over, we look at our previous epoch count vs the global count and
> use that ratio to size our per-collection cache.  Since this doesn't have
> to be super-precise we can get clever with atomic and per-cpu variables if
> we need to on the global count.

I'm a little concerned about this due to the nature of some of our IO
patterns. Maybe it's not a problem within the levels you're talking
about, but consider:
1) RGW cluster, in which the index pool gets a hugely disproportionate
number of ops in comparison to its actual size (at least for writes)
2) RBD cluster, in which you can expect a golden master pool to get a
lot of reads but have much less total data compared to the user block
device pools.

A system that naively allocates cache space based on proportion of ops
is going to perform pretty badly.
-Greg

>
> Within a collection, we're under the same collection lock, so a simple lru
> or onodes and buffers ought to suffice.  I think we want something better
> than just a straight-LRU, though: some data is hinted WILLNEED, and
> buffers we hit in cache twice should get bumped up higher than stuff we
> just read off of disk.  The MDS uses a simple 2-level LRU list; I suspect
> something like MQ might be a better choice for us, but this is probably a
> secondary issue.. we can optimize this independently once we have the
> overall approach sorted out.
>
> Anyway, I guess what I'm looking for is feedback on (1) above, and whether
> per-collection caches with periodic size calibration (based on workload)
> sounds reasonable.
>
> Thanks!
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html