RE: bluestore cache

Jianjian Huo <jianjian.huo@xxxxxxxxxxx> · Fri, 27 May 2016 03:38:43 +0000

On Thu, May 26, 2016 at 12:27 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> Previously we were relying on the block device cache when appropriate
> (when a rados hint indicated we should, usually), but that is unreliable
> for annoying O_DIRECT reasons.  We also need an explicit cache to cope
> with some of the read/modify/write cases with compression and checksum
> block/chunk sizes.
>
> A basic cache is in place that is attached to each Onode, and maps the
> logical object bytes to buffers.  We also have a per-collection onode
> cache.  Currently the only trimming of data happens when onodes are
> trimmed, and we control that using a coarse per-collection num_onodes
> knob.
>
> There are two basic questions:
>
> 1. First, should we stick with a logical extent -> buffer mapping, or move
> to a blob -> blob extent mapping.  The former is simpler and attaches to
> the onode cache, which we also have to fix trimming for anyway.  On the
> other hand, when we clone an object (usually a head object to a snapshot),
> the cache doesn't follow.  Moving to a blob-based scheme would work better
> in that regard, but probably means that we have another object
> (bluestore_blob_t) whose lifecycle we need to manage.
>
> I'm inclined to stick with the current scheme since reads from just-cloned
> snaps probably aren't too frequent, at least until we have a better
> idea how to do the lifecycle with the simpler (current) model.
>
> 2. Second, we need to trim the buffers.  The per-collection onode cache is
> nice because the LRU is local to the collection and already protected by
> existing locks, which avoids complicated locking in the trim path that
> we'd get from a global LRU.  On the other hand, it's clearly suboptimal:
> some pools will get more IO than others and we want to apportion our cache
> resources more fairly.
>
> My hope is that we can do both using a scheme that has collection-local
> LRU (or ARC or some other cache policy) for onodes and buffers, and then
> have a global view of what proportion of the cache a collection is
> entitled to and drive our trimming against that.  This won't be super
> precise, but I wouldn't even if we fudge where the "perfect"
> per-collection cache size is by say 30% I wouldn't expect to see a huge
> cache hit range change over that range (unless your cache is already
> pretty undersized).
>
> Anyway, a simple way to estimate a collections portion of the total might
> be to have a periodic global epoch counter that increments every, say, 10
> seconds.  Then we could ops globally and per collection.  When the epoch
> rolls over, we look at our previous epoch count vs the global count and
> use that ratio to size our per-collection cache.  Since this doesn't have
> to be super-precise we can get clever with atomic and per-cpu variables if
> we need to on the global count.
>
> Within a collection, we're under the same collection lock, so a simple lru
> or onodes and buffers ought to suffice.  I think we want something better
> than just a straight-LRU, though: some data is hinted WILLNEED, and
> buffers we hit in cache twice should get bumped up higher than stuff we
> just read off of disk.  The MDS uses a simple 2-level LRU list; I suspect
> something like MQ might be a better choice for us, but this is probably a
> secondary issue.. we can optimize this independently once we have the
> overall approach sorted out.
>
> Anyway, I guess what I'm looking for is feedback on (1) above, and whether
> per-collection caches with periodic size calibration (based on workload)
> sounds reasonable.

Very good design, sharded cache without additional locks, adapts to different workloads. One benefit this logical extent caching doesn't have but blob caching has, if data set compression rate is high, then blob caching will use less RAM? also users tend to use more PGs on SSD deployment, will per collection cache use more CPU cycles with too many PGs?

Jianjian
>
> Thanks!
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html