Re: bluestore blobs REVISITED

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Thu, 25 Aug 2016 19:10:49 +0000

Yes a physically indexed cache solves that problem. But you will suffer the translation overhead on a read hit - still probably the right choice. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Aug 25, 2016, at 3:08 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> 
>> On Thu, 25 Aug 2016, Sage Weil wrote:
>>> On Thu, 25 Aug 2016, Jason Dillaman wrote:
>>> Just so I understand, let's say a user snapshots an RBD image that has
>>> active IO. At this point, are you saying that the "A" data
>>> (pre-snapshot) is still (potentially) in the cache and any write
>>> op-induced creation of clone "B" would not be in the cache?  If that's
>>> the case, it sounds like a re-read would be required after the first
>>> "post snapshot" write op.
>> 
>> I mean you could have a sequence like
>> 
>> write A 0~4096 to disk block X
>> clone A -> B
>> read A 0~4096   (cache hit, it's still there)
>> read B 0~4096   (cache miss, read disk block X.  now 2 copies of X in ram)
>> read A 0~4096   (cache hit again, it's still there)
>> 
>> The question is whether the "miss" reading B is concerning.  Or the 
>> double-caching, I suppose.
> 
> You know what, I take it all back.  We can uniquely identify blobs by 
> their starting LBA, so there's no reason we can't unify the caches as 
> before.
> 
> sage
> 
> 
> 
>> 
>> sage
>> 
>> 
>>> 
>>>> On Wed, Aug 24, 2016 at 6:29 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>>> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>>>> Yikes. You mean that blob ids are escaping the environment of the
>>>>> lextent table. That's scary. What is the key for this cache? We probably
>>>>> need to invalidate it or something.
>>>> 
>>>> I mean that there will no longer be blob ids (except within the encoding
>>>> of a particular extent map shard).  Which means that when you write to A,
>>>> clone A->B, and then read B, B's blob will no longer be the same as A's
>>>> blob (as it is now in the bnode, or would have been with the -blobwise
>>>> branch) and the cache won't be preserved.
>>>> 
>>>> Which I *think* is okay...?
>>>> 
>>>> sage
>>>> 
>>>> 
>>>>> 
>>>>> Sent from my iPhone. Please excuse all typos and autocorrects.
>>>>> 
>>>>>> On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>>>> 
>>>>>> On Wed, 24 Aug 2016, Allen Samuels wrote:
>>>>>>>> In that case, we should focus instead on sharing the ref_map *only* and
>>>>>>>> always inline the forward pointers for the blob.  This is closer to what
>>>>>>>> we were originally doing with the enode.  In fact, we could go back to the
>>>>>>>> enode approach were it's just a big extent_ref_map and only used to defer
>>>>>>>> deallocations until all refs are retired.  The blob is then more ephemeral
>>>>>>>> (local to the onode, immutable copy if cloned), and we can more easily
>>>>>>>> rejigger how we store it.
>>>>>>>> 
>>>>>>>> We'd still have a "ref map" type structure for the blob, but it would only
>>>>>>>> be used for counting the lextents that reference it, and we can
>>>>>>>> dynamically build it when we load the extent map.  If we impose the
>>>>>>>> restriction that whatever the map sharding approach we take we never share
>>>>>>>> a blob across a shard, we the blobs are always local and "ephemeral"
>>>>>>>> in the sense we've been talking about.  The only downside here, I think,
>>>>>>>> is that the write path needs to be smart enough to not create any new blob
>>>>>>>> that spans whatever the current map sharding is (or, alternatively,
>>>>>>>> trigger a resharding if it does so).
>>>>>>> 
>>>>>>> Not just a resharding but also a possible decompress recompress cycle.
>>>>>> 
>>>>>> Yeah.
>>>>>> 
>>>>>> Oh, the other consequence of this is that we lose the unified blob-wise
>>>>>> cache behavior we added a while back.  That means that if you write a
>>>>>> bunch of data to a rbd data object, then clone it, then read of the clone,
>>>>>> it'll re-read the data from disk.  Because it'll be a different blob in
>>>>>> memory (since we'll be making a copy of the metadata etc).
>>>>>> 
>>>>>> Josh, Jason, do you have a sense of whether that really matters?  The
>>>>>> common case is probably someone who creates a snapshot and then backs it
>>>>>> up, but it's going to be reading gobs of cold data off disk anyway so I'm
>>>>>> guessing it doesn't matter that a bit of warm data that just preceded the
>>>>>> snapshot gets re-read.
>>>>>> 
>>>>>> sage
>>> 
>>> 
>>> 
>>> -- 
>>> Jason
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html