Re: dedup next steps

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 8 Oct 2018 08:18:37 -0500

On 10/08/2018 07:44 AM, Sage Weil wrote:
Hi Myoungwon,

I had a good conversation last week with Corwin Coburn and Jered Floyd
(two of the dedup experts who joined Red Hat last year as part of the
Permabit acquisition) about distributed dedup in Ceph.

The gist of our conversation is that the approach we're currently taking
is the only one we've looked at that makes sense, but how effective it is
really comes down to how well we can dedup based on a content-based
chunking algorithm with largish chunks.  Their product (VDO) was very
effective because they chunked at very small granularity (4 KB blocks),
which allowed them to get good dedup ratios, even on content that had
limited similarities at the block layer, but such small chunk sizes aren't
practical in a distributed pool like RADOS.  In their experience (having
tried many different architectural approaches over the last 10 years),
dedup ratios fall off sharply as the chunk sizes increase.

There was a video-streaming use case a while back where potentially the 
same video data might be duplicated many times.  Would those still be 
candidates for coarse-grained dedup?

The main takeaway was that our big unknown is how well content-based
chunking is going to work on real data sets--specifically, the kinds of
data sets our users do/will have.  The other thought was that it probably
makes sense to look at RGW object workloads initially: given the read
penalty we'll see on deduped objects, users are most likely to deploy this
on colder object data sets.

It fills me with a bit of dread. I guess the question I would have is 
whether we actually expect there to be a lot of small object dedup 
benefit.  I could see if for (small) image hosting ala facebook I 
guess.  What other use cases did the Permabit guys target?

What I'd like to do soon/next is to write a tool that will enumerate rados
objects in a given pool, read each object and apply a chunking algorithm
(e.g., "rabin sliding window with ~128KB target object size"), and then
calculate some stats by building a big in-memory unordered_set (or
similar) of the hash values.  This will allow us to run this on any
cluster with an existing data set to estimate what type of dedup ratios we
can expect.  We can try it with different data sets, different chunk
sizes, and eventually more sophisticated chunking algorithms (e.g., ones
that chunk more intelligently based on recognized file formats, etc.).
We can also make the tool extrapolate its results after scanning only a
portion of the object namespace (say, only a specific range of PGs) since
our objects are uniformately distributed across the pool hash space (and
the tool needs to fit its big hash table of fingerprints in memory).

What do you think?

A tool that could look at an existing cluster and make a greedy and 
interruptible estimate of potential dedup ratios at different 
granularity levels is absolutely the right way to approach this imho.