dedup next steps

Sage Weil <sweil@xxxxxxxxxx> · Mon, 8 Oct 2018 12:44:53 +0000 (UTC)

Hi Myoungwon,

I had a good conversation last week with Corwin Coburn and Jered Floyd 
(two of the dedup experts who joined Red Hat last year as part of the 
Permabit acquisition) about distributed dedup in Ceph.

The gist of our conversation is that the approach we're currently taking 
is the only one we've looked at that makes sense, but how effective it is 
really comes down to how well we can dedup based on a content-based 
chunking algorithm with largish chunks.  Their product (VDO) was very 
effective because they chunked at very small granularity (4 KB blocks), 
which allowed them to get good dedup ratios, even on content that had 
limited similarities at the block layer, but such small chunk sizes aren't 
practical in a distributed pool like RADOS.  In their experience (having 
tried many different architectural approaches over the last 10 years), 
dedup ratios fall off sharply as the chunk sizes increase.

The main takeaway was that our big unknown is how well content-based 
chunking is going to work on real data sets--specifically, the kinds of 
data sets our users do/will have.  The other thought was that it probably 
makes sense to look at RGW object workloads initially: given the read 
penalty we'll see on deduped objects, users are most likely to deploy this 
on colder object data sets.

What I'd like to do soon/next is to write a tool that will enumerate rados 
objects in a given pool, read each object and apply a chunking algorithm 
(e.g., "rabin sliding window with ~128KB target object size"), and then 
calculate some stats by building a big in-memory unordered_set (or 
similar) of the hash values.  This will allow us to run this on any 
cluster with an existing data set to estimate what type of dedup ratios we 
can expect.  We can try it with different data sets, different chunk 
sizes, and eventually more sophisticated chunking algorithms (e.g., ones 
that chunk more intelligently based on recognized file formats, etc.).  
We can also make the tool extrapolate its results after scanning only a 
portion of the object namespace (say, only a specific range of PGs) since 
our objects are uniformately distributed across the pool hash space (and 
the tool needs to fit its big hash table of fingerprints in memory).

What do you think?
sage