Re: bluestore blobs REVISITED

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Wed, 24 Aug 2016 22:13:26 +0000

Yikes. You mean that blob ids are escaping the environment of the lextent table. That's scary. What is the key for this cache? We probably need to invalidate it or something. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> 
> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>> In that case, we should focus instead on sharing the ref_map *only* and 
>>> always inline the forward pointers for the blob.  This is closer to what 
>>> we were originally doing with the enode.  In fact, we could go back to the 
>>> enode approach were it's just a big extent_ref_map and only used to defer 
>>> deallocations until all refs are retired.  The blob is then more ephemeral 
>>> (local to the onode, immutable copy if cloned), and we can more easily 
>>> rejigger how we store it.
>>> 
>>> We'd still have a "ref map" type structure for the blob, but it would only 
>>> be used for counting the lextents that reference it, and we can 
>>> dynamically build it when we load the extent map.  If we impose the 
>>> restriction that whatever the map sharding approach we take we never share 
>>> a blob across a shard, we the blobs are always local and "ephemeral" 
>>> in the sense we've been talking about.  The only downside here, I think, 
>>> is that the write path needs to be smart enough to not create any new blob 
>>> that spans whatever the current map sharding is (or, alternatively, 
>>> trigger a resharding if it does so).
>> 
>> Not just a resharding but also a possible decompress recompress cycle.
> 
> Yeah.
> 
> Oh, the other consequence of this is that we lose the unified blob-wise 
> cache behavior we added a while back.  That means that if you write a 
> bunch of data to a rbd data object, then clone it, then read of the clone, 
> it'll re-read the data from disk.  Because it'll be a different blob in 
> memory (since we'll be making a copy of the metadata etc).
> 
> Josh, Jason, do you have a sense of whether that really matters?  The 
> common case is probably someone who creates a snapshot and then backs it 
> up, but it's going to be reading gobs of cold data off disk anyway so I'm 
> guessing it doesn't matter that a bit of warm data that just preceded the 
> snapshot gets re-read.
> 
> sage
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html