Yes a physically indexed cache solves that problem. But you will suffer the translation overhead on a read hit - still probably the right choice. Sent from my iPhone. Please excuse all typos and autocorrects. > On Aug 25, 2016, at 3:08 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > >> On Thu, 25 Aug 2016, Sage Weil wrote: >>> On Thu, 25 Aug 2016, Jason Dillaman wrote: >>> Just so I understand, let's say a user snapshots an RBD image that has >>> active IO. At this point, are you saying that the "A" data >>> (pre-snapshot) is still (potentially) in the cache and any write >>> op-induced creation of clone "B" would not be in the cache? If that's >>> the case, it sounds like a re-read would be required after the first >>> "post snapshot" write op. >> >> I mean you could have a sequence like >> >> write A 0~4096 to disk block X >> clone A -> B >> read A 0~4096 (cache hit, it's still there) >> read B 0~4096 (cache miss, read disk block X. now 2 copies of X in ram) >> read A 0~4096 (cache hit again, it's still there) >> >> The question is whether the "miss" reading B is concerning. Or the >> double-caching, I suppose. > > You know what, I take it all back. We can uniquely identify blobs by > their starting LBA, so there's no reason we can't unify the caches as > before. > > sage > > > >> >> sage >> >> >>> >>>> On Wed, Aug 24, 2016 at 6:29 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >>>>> On Wed, 24 Aug 2016, Allen Samuels wrote: >>>>> Yikes. You mean that blob ids are escaping the environment of the >>>>> lextent table. That's scary. What is the key for this cache? We probably >>>>> need to invalidate it or something. >>>> >>>> I mean that there will no longer be blob ids (except within the encoding >>>> of a particular extent map shard). Which means that when you write to A, >>>> clone A->B, and then read B, B's blob will no longer be the same as A's >>>> blob (as it is now in the bnode, or would have been with the -blobwise >>>> branch) and the cache won't be preserved. >>>> >>>> Which I *think* is okay...? >>>> >>>> sage >>>> >>>> >>>>> >>>>> Sent from my iPhone. Please excuse all typos and autocorrects. >>>>> >>>>>> On Aug 24, 2016, at 5:18 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >>>>>> >>>>>> On Wed, 24 Aug 2016, Allen Samuels wrote: >>>>>>>> In that case, we should focus instead on sharing the ref_map *only* and >>>>>>>> always inline the forward pointers for the blob. This is closer to what >>>>>>>> we were originally doing with the enode. In fact, we could go back to the >>>>>>>> enode approach were it's just a big extent_ref_map and only used to defer >>>>>>>> deallocations until all refs are retired. The blob is then more ephemeral >>>>>>>> (local to the onode, immutable copy if cloned), and we can more easily >>>>>>>> rejigger how we store it. >>>>>>>> >>>>>>>> We'd still have a "ref map" type structure for the blob, but it would only >>>>>>>> be used for counting the lextents that reference it, and we can >>>>>>>> dynamically build it when we load the extent map. If we impose the >>>>>>>> restriction that whatever the map sharding approach we take we never share >>>>>>>> a blob across a shard, we the blobs are always local and "ephemeral" >>>>>>>> in the sense we've been talking about. The only downside here, I think, >>>>>>>> is that the write path needs to be smart enough to not create any new blob >>>>>>>> that spans whatever the current map sharding is (or, alternatively, >>>>>>>> trigger a resharding if it does so). >>>>>>> >>>>>>> Not just a resharding but also a possible decompress recompress cycle. >>>>>> >>>>>> Yeah. >>>>>> >>>>>> Oh, the other consequence of this is that we lose the unified blob-wise >>>>>> cache behavior we added a while back. That means that if you write a >>>>>> bunch of data to a rbd data object, then clone it, then read of the clone, >>>>>> it'll re-read the data from disk. Because it'll be a different blob in >>>>>> memory (since we'll be making a copy of the metadata etc). >>>>>> >>>>>> Josh, Jason, do you have a sense of whether that really matters? The >>>>>> common case is probably someone who creates a snapshot and then backs it >>>>>> up, but it's going to be reading gobs of cold data off disk anyway so I'm >>>>>> guessing it doesn't matter that a bit of warm data that just preceded the >>>>>> snapshot gets re-read. >>>>>> >>>>>> sage >>> >>> >>> >>> -- >>> Jason >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html