Re: bluestore blobs REVISITED

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Wed, 24 Aug 2016 23:41:30 +0000

Your suggesting a logical address cache key (oid offset) rather rhan a physical cache (lba). Which seems fine to me. Provided that deletes and renames properly purge the cache. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 24, 2016, at 6:29 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> 
>> On Wed, 24 Aug 2016, Allen Samuels wrote:
>> Yikes. You mean that blob ids are escaping the environment of the 
>> lextent table. That's scary. What is the key for this cache? We probably 
>> need to invalidate it or something.
> 
> I mean that there will no longer be blob ids (except within the encoding 
> of a particular extent map shard).  Which means that when you write to A, 
> clone A->B, and then read B, B's blob will no longer be the same as A's 
> blob (as it is now in the bnode, or would have been with the -blobwise 
> branch) and the cache won't be preserved.
> 
> Which I *think* is okay...?
> 
> sage
> 
> 
>> 
>> Sent from my iPhone. Please excuse all typos and autocorrects.
>> 
>>> On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>> 
>>> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>>>> In that case, we should focus instead on sharing the ref_map *only* and 
>>>>> always inline the forward pointers for the blob.  This is closer to what 
>>>>> we were originally doing with the enode.  In fact, we could go back to the 
>>>>> enode approach were it's just a big extent_ref_map and only used to defer 
>>>>> deallocations until all refs are retired.  The blob is then more ephemeral 
>>>>> (local to the onode, immutable copy if cloned), and we can more easily 
>>>>> rejigger how we store it.
>>>>> 
>>>>> We'd still have a "ref map" type structure for the blob, but it would only 
>>>>> be used for counting the lextents that reference it, and we can 
>>>>> dynamically build it when we load the extent map.  If we impose the 
>>>>> restriction that whatever the map sharding approach we take we never share 
>>>>> a blob across a shard, we the blobs are always local and "ephemeral" 
>>>>> in the sense we've been talking about.  The only downside here, I think, 
>>>>> is that the write path needs to be smart enough to not create any new blob 
>>>>> that spans whatever the current map sharding is (or, alternatively, 
>>>>> trigger a resharding if it does so).
>>>> 
>>>> Not just a resharding but also a possible decompress recompress cycle.
>>> 
>>> Yeah.
>>> 
>>> Oh, the other consequence of this is that we lose the unified blob-wise 
>>> cache behavior we added a while back.  That means that if you write a 
>>> bunch of data to a rbd data object, then clone it, then read of the clone, 
>>> it'll re-read the data from disk.  Because it'll be a different blob in 
>>> memory (since we'll be making a copy of the metadata etc).
>>> 
>>> Josh, Jason, do you have a sense of whether that really matters?  The 
>>> common case is probably someone who creates a snapshot and then backs it 
>>> up, but it's going to be reading gobs of cold data off disk anyway so I'm 
>>> guessing it doesn't matter that a bit of warm data that just preceded the 
>>> snapshot gets re-read.
>>> 
>>> sage
>> 
>> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html