On Wed, 24 Aug 2016, Allen Samuels wrote: > > In that case, we should focus instead on sharing the ref_map *only* and > > always inline the forward pointers for the blob. This is closer to what > > we were originally doing with the enode. In fact, we could go back to the > > enode approach were it's just a big extent_ref_map and only used to defer > > deallocations until all refs are retired. The blob is then more ephemeral > > (local to the onode, immutable copy if cloned), and we can more easily > > rejigger how we store it. > > > > We'd still have a "ref map" type structure for the blob, but it would only > > be used for counting the lextents that reference it, and we can > > dynamically build it when we load the extent map. If we impose the > > restriction that whatever the map sharding approach we take we never share > > a blob across a shard, we the blobs are always local and "ephemeral" > > in the sense we've been talking about. The only downside here, I think, > > is that the write path needs to be smart enough to not create any new blob > > that spans whatever the current map sharding is (or, alternatively, > > trigger a resharding if it does so). > > Not just a resharding but also a possible decompress recompress cycle. Yeah. Oh, the other consequence of this is that we lose the unified blob-wise cache behavior we added a while back. That means that if you write a bunch of data to a rbd data object, then clone it, then read of the clone, it'll re-read the data from disk. Because it'll be a different blob in memory (since we'll be making a copy of the metadata etc). Josh, Jason, do you have a sense of whether that really matters? The common case is probably someone who creates a snapshot and then backs it up, but it's going to be reading gobs of cold data off disk anyway so I'm guessing it doesn't matter that a bit of warm data that just preceded the snapshot gets re-read. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html