Re: bluestore blobs REVISITED

Jason Dillaman <jdillama@xxxxxxxxxx> · Thu, 25 Aug 2016 08:40:09 -0400

Just so I understand, let's say a user snapshots an RBD image that has
active IO. At this point, are you saying that the "A" data
(pre-snapshot) is still (potentially) in the cache and any write
op-induced creation of clone "B" would not be in the cache?  If that's
the case, it sounds like a re-read would be required after the first
"post snapshot" write op.

On Wed, Aug 24, 2016 at 6:29 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Wed, 24 Aug 2016, Allen Samuels wrote:
>> Yikes. You mean that blob ids are escaping the environment of the
>> lextent table. That's scary. What is the key for this cache? We probably
>> need to invalidate it or something.
>
> I mean that there will no longer be blob ids (except within the encoding
> of a particular extent map shard).  Which means that when you write to A,
> clone A->B, and then read B, B's blob will no longer be the same as A's
> blob (as it is now in the bnode, or would have been with the -blobwise
> branch) and the cache won't be preserved.
>
> Which I *think* is okay...?
>
> sage
>
>
>>
>> Sent from my iPhone. Please excuse all typos and autocorrects.
>>
>> > On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> >
>> > On Wed, 24 Aug 2016, Allen Samuels wrote:
>> >>> In that case, we should focus instead on sharing the ref_map *only* and
>> >>> always inline the forward pointers for the blob.  This is closer to what
>> >>> we were originally doing with the enode.  In fact, we could go back to the
>> >>> enode approach were it's just a big extent_ref_map and only used to defer
>> >>> deallocations until all refs are retired.  The blob is then more ephemeral
>> >>> (local to the onode, immutable copy if cloned), and we can more easily
>> >>> rejigger how we store it.
>> >>>
>> >>> We'd still have a "ref map" type structure for the blob, but it would only
>> >>> be used for counting the lextents that reference it, and we can
>> >>> dynamically build it when we load the extent map.  If we impose the
>> >>> restriction that whatever the map sharding approach we take we never share
>> >>> a blob across a shard, we the blobs are always local and "ephemeral"
>> >>> in the sense we've been talking about.  The only downside here, I think,
>> >>> is that the write path needs to be smart enough to not create any new blob
>> >>> that spans whatever the current map sharding is (or, alternatively,
>> >>> trigger a resharding if it does so).
>> >>
>> >> Not just a resharding but also a possible decompress recompress cycle.
>> >
>> > Yeah.
>> >
>> > Oh, the other consequence of this is that we lose the unified blob-wise
>> > cache behavior we added a while back.  That means that if you write a
>> > bunch of data to a rbd data object, then clone it, then read of the clone,
>> > it'll re-read the data from disk.  Because it'll be a different blob in
>> > memory (since we'll be making a copy of the metadata etc).
>> >
>> > Josh, Jason, do you have a sense of whether that really matters?  The
>> > common case is probably someone who creates a snapshot and then backs it
>> > up, but it's going to be reading gobs of cold data off disk anyway so I'm
>> > guessing it doesn't matter that a bit of warm data that just preceded the
>> > snapshot gets re-read.
>> >
>> > sage
>> >
>>
>>

-- 
Jason
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html