bluestore cache

Sage Weil <sweil@xxxxxxxxxx> · Thu, 26 May 2016 15:27:36 -0400 (EDT)

Previously we were relying on the block device cache when appropriate 
(when a rados hint indicated we should, usually), but that is unreliable 
for annoying O_DIRECT reasons.  We also need an explicit cache to cope 
with some of the read/modify/write cases with compression and checksum 
block/chunk sizes.

A basic cache is in place that is attached to each Onode, and maps the 
logical object bytes to buffers.  We also have a per-collection onode 
cache.  Currently the only trimming of data happens when onodes are 
trimmed, and we control that using a coarse per-collection num_onodes 
knob.

There are two basic questions:

1. First, should we stick with a logical extent -> buffer mapping, or move 
to a blob -> blob extent mapping.  The former is simpler and attaches to 
the onode cache, which we also have to fix trimming for anyway.  On the 
other hand, when we clone an object (usually a head object to a snapshot), 
the cache doesn't follow.  Moving to a blob-based scheme would work better 
in that regard, but probably means that we have another object 
(bluestore_blob_t) whose lifecycle we need to manage.

I'm inclined to stick with the current scheme since reads from just-cloned 
snaps probably aren't too frequent, at least until we have a better 
idea how to do the lifecycle with the simpler (current) model.

2. Second, we need to trim the buffers.  The per-collection onode cache is 
nice because the LRU is local to the collection and already protected by 
existing locks, which avoids complicated locking in the trim path that 
we'd get from a global LRU.  On the other hand, it's clearly suboptimal: 
some pools will get more IO than others and we want to apportion our cache 
resources more fairly.

My hope is that we can do both using a scheme that has collection-local 
LRU (or ARC or some other cache policy) for onodes and buffers, and then 
have a global view of what proportion of the cache a collection is 
entitled to and drive our trimming against that.  This won't be super 
precise, but I wouldn't even if we fudge where the "perfect" 
per-collection cache size is by say 30% I wouldn't expect to see a huge 
cache hit range change over that range (unless your cache is already 
pretty undersized).

Anyway, a simple way to estimate a collections portion of the total might 
be to have a periodic global epoch counter that increments every, say, 10 
seconds.  Then we could ops globally and per collection.  When the epoch 
rolls over, we look at our previous epoch count vs the global count and 
use that ratio to size our per-collection cache.  Since this doesn't have 
to be super-precise we can get clever with atomic and per-cpu variables if 
we need to on the global count.

Within a collection, we're under the same collection lock, so a simple lru 
or onodes and buffers ought to suffice.  I think we want something better 
than just a straight-LRU, though: some data is hinted WILLNEED, and 
buffers we hit in cache twice should get bumped up higher than stuff we 
just read off of disk.  The MDS uses a simple 2-level LRU list; I suspect 
something like MQ might be a better choice for us, but this is probably a 
secondary issue.. we can optimize this independently once we have the 
overall approach sorted out.

Anyway, I guess what I'm looking for is feedback on (1) above, and whether 
per-collection caches with periodic size calibration (based on workload) 
sounds reasonable.

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html