On Mon, 8 Oct 2018, Mark Nelson wrote: > On 10/08/2018 07:44 AM, Sage Weil wrote: > > Hi Myoungwon, > > > > I had a good conversation last week with Corwin Coburn and Jered Floyd > > (two of the dedup experts who joined Red Hat last year as part of the > > Permabit acquisition) about distributed dedup in Ceph. > > > > The gist of our conversation is that the approach we're currently taking > > is the only one we've looked at that makes sense, but how effective it is > > really comes down to how well we can dedup based on a content-based > > chunking algorithm with largish chunks. Their product (VDO) was very > > effective because they chunked at very small granularity (4 KB blocks), > > which allowed them to get good dedup ratios, even on content that had > > limited similarities at the block layer, but such small chunk sizes aren't > > practical in a distributed pool like RADOS. In their experience (having > > tried many different architectural approaches over the last 10 years), > > dedup ratios fall off sharply as the chunk sizes increase. > > There was a video-streaming use case a while back where potentially the same > video data might be duplicated many times. Would those still be candidates > for coarse-grained dedup? Yes... the content-based chunking should work well even with very large chunk sizes for this use case. > > The main takeaway was that our big unknown is how well content-based > > chunking is going to work on real data sets--specifically, the kinds > > of data sets our users do/will have. The other thought was that it > > probably makes sense to look at RGW object workloads initially: given > > the read penalty we'll see on deduped objects, users are most likely > > to deploy this on colder object data sets. > > It fills me with a bit of dread. I guess the question I would have is whether > we actually expect there to be a lot of small object dedup benefit. I could > see if for (small) image hosting ala facebook I guess. What other use cases > did the Permabit guys target? VDO is pretty general-purpose, targetting a single block device on a single host. This lets them slot in dedup in cases where you wouldn't traditionally expect it: with flash, any regular block device that isn't super performance-sensitive could be deduped and with flash it will still perform reasonably. For us, deduping small images would work (they'd be one chunk), but there will be an additional layer of indirection to map names to content, and the storage savings are probably not worth it for a small data set that is performance sensitive. What I keep coming back to is that I expect that the bulk of the data stored is going to land in RGW (video, images, etc) or maybe CephFS and that the data landing in block is probably going to be (1) both a small fraction of the total data footprint and (2) performance sensitive. sage > > What I'd like to do soon/next is to write a tool that will enumerate rados > > objects in a given pool, read each object and apply a chunking algorithm > > (e.g., "rabin sliding window with ~128KB target object size"), and then > > calculate some stats by building a big in-memory unordered_set (or > > similar) of the hash values. This will allow us to run this on any > > cluster with an existing data set to estimate what type of dedup ratios we > > can expect. We can try it with different data sets, different chunk > > sizes, and eventually more sophisticated chunking algorithms (e.g., ones > > that chunk more intelligently based on recognized file formats, etc.). > > We can also make the tool extrapolate its results after scanning only a > > portion of the object namespace (say, only a specific range of PGs) since > > our objects are uniformately distributed across the pool hash space (and > > the tool needs to fit its big hash table of fingerprints in memory). > > > > What do you think? > > A tool that could look at an existing cluster and make a greedy and > interruptible estimate of potential dedup ratios at different granularity > levels is absolutely the right way to approach this imho. > >