On Thu, 25 Aug 2016, Sage Weil wrote: > On Thu, 25 Aug 2016, Jason Dillaman wrote: > > Just so I understand, let's say a user snapshots an RBD image that has > > active IO. At this point, are you saying that the "A" data > > (pre-snapshot) is still (potentially) in the cache and any write > > op-induced creation of clone "B" would not be in the cache? If that's > > the case, it sounds like a re-read would be required after the first > > "post snapshot" write op. > > I mean you could have a sequence like > > write A 0~4096 to disk block X > clone A -> B > read A 0~4096 (cache hit, it's still there) > read B 0~4096 (cache miss, read disk block X. now 2 copies of X in ram) > read A 0~4096 (cache hit again, it's still there) > > The question is whether the "miss" reading B is concerning. Or the > double-caching, I suppose. You know what, I take it all back. We can uniquely identify blobs by their starting LBA, so there's no reason we can't unify the caches as before. sage > > sage > > > > > > On Wed, Aug 24, 2016 at 6:29 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > > > On Wed, 24 Aug 2016, Allen Samuels wrote: > > >> Yikes. You mean that blob ids are escaping the environment of the > > >> lextent table. That's scary. What is the key for this cache? We probably > > >> need to invalidate it or something. > > > > > > I mean that there will no longer be blob ids (except within the encoding > > > of a particular extent map shard). Which means that when you write to A, > > > clone A->B, and then read B, B's blob will no longer be the same as A's > > > blob (as it is now in the bnode, or would have been with the -blobwise > > > branch) and the cache won't be preserved. > > > > > > Which I *think* is okay...? > > > > > > sage > > > > > > > > >> > > >> Sent from my iPhone. Please excuse all typos and autocorrects. > > >> > > >> > On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > > >> > > > >> > On Wed, 24 Aug 2016, Allen Samuels wrote: > > >> >>> In that case, we should focus instead on sharing the ref_map *only* and > > >> >>> always inline the forward pointers for the blob. This is closer to what > > >> >>> we were originally doing with the enode. In fact, we could go back to the > > >> >>> enode approach were it's just a big extent_ref_map and only used to defer > > >> >>> deallocations until all refs are retired. The blob is then more ephemeral > > >> >>> (local to the onode, immutable copy if cloned), and we can more easily > > >> >>> rejigger how we store it. > > >> >>> > > >> >>> We'd still have a "ref map" type structure for the blob, but it would only > > >> >>> be used for counting the lextents that reference it, and we can > > >> >>> dynamically build it when we load the extent map. If we impose the > > >> >>> restriction that whatever the map sharding approach we take we never share > > >> >>> a blob across a shard, we the blobs are always local and "ephemeral" > > >> >>> in the sense we've been talking about. The only downside here, I think, > > >> >>> is that the write path needs to be smart enough to not create any new blob > > >> >>> that spans whatever the current map sharding is (or, alternatively, > > >> >>> trigger a resharding if it does so). > > >> >> > > >> >> Not just a resharding but also a possible decompress recompress cycle. > > >> > > > >> > Yeah. > > >> > > > >> > Oh, the other consequence of this is that we lose the unified blob-wise > > >> > cache behavior we added a while back. That means that if you write a > > >> > bunch of data to a rbd data object, then clone it, then read of the clone, > > >> > it'll re-read the data from disk. Because it'll be a different blob in > > >> > memory (since we'll be making a copy of the metadata etc). > > >> > > > >> > Josh, Jason, do you have a sense of whether that really matters? The > > >> > common case is probably someone who creates a snapshot and then backs it > > >> > up, but it's going to be reading gobs of cold data off disk anyway so I'm > > >> > guessing it doesn't matter that a bit of warm data that just preceded the > > >> > snapshot gets re-read. > > >> > > > >> > sage > > >> > > > >> > > >> > > > > > > > > -- > > Jason > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html