On 10/08/2018 07:44 AM, Sage Weil wrote:
Hi Myoungwon, I had a good conversation last week with Corwin Coburn and Jered Floyd (two of the dedup experts who joined Red Hat last year as part of the Permabit acquisition) about distributed dedup in Ceph. The gist of our conversation is that the approach we're currently taking is the only one we've looked at that makes sense, but how effective it is really comes down to how well we can dedup based on a content-based chunking algorithm with largish chunks. Their product (VDO) was very effective because they chunked at very small granularity (4 KB blocks), which allowed them to get good dedup ratios, even on content that had limited similarities at the block layer, but such small chunk sizes aren't practical in a distributed pool like RADOS. In their experience (having tried many different architectural approaches over the last 10 years), dedup ratios fall off sharply as the chunk sizes increase.
There was a video-streaming use case a while back where potentially the same video data might be duplicated many times. Would those still be candidates for coarse-grained dedup?
The main takeaway was that our big unknown is how well content-based chunking is going to work on real data sets--specifically, the kinds of data sets our users do/will have. The other thought was that it probably makes sense to look at RGW object workloads initially: given the read penalty we'll see on deduped objects, users are most likely to deploy this on colder object data sets.
It fills me with a bit of dread. I guess the question I would have is whether we actually expect there to be a lot of small object dedup benefit. I could see if for (small) image hosting ala facebook I guess. What other use cases did the Permabit guys target?
What I'd like to do soon/next is to write a tool that will enumerate rados objects in a given pool, read each object and apply a chunking algorithm (e.g., "rabin sliding window with ~128KB target object size"), and then calculate some stats by building a big in-memory unordered_set (or similar) of the hash values. This will allow us to run this on any cluster with an existing data set to estimate what type of dedup ratios we can expect. We can try it with different data sets, different chunk sizes, and eventually more sophisticated chunking algorithms (e.g., ones that chunk more intelligently based on recognized file formats, etc.). We can also make the tool extrapolate its results after scanning only a portion of the object namespace (say, only a specific range of PGs) since our objects are uniformately distributed across the pool hash space (and the tool needs to fit its big hash table of fingerprints in memory). What do you think?
A tool that could look at an existing cluster and make a greedy and interruptible estimate of potential dedup ratios at different granularity levels is absolutely the right way to approach this imho.