Re: dedup next steps

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/8/18 9:28 AM, Sage Weil wrote:
On Mon, 8 Oct 2018, Mark Nelson wrote:
On 10/08/2018 07:44 AM, Sage Weil wrote:
Hi Myoungwon,

I had a good conversation last week with Corwin Coburn and Jered Floyd
(two of the dedup experts who joined Red Hat last year as part of the
Permabit acquisition) about distributed dedup in Ceph.

The gist of our conversation is that the approach we're currently taking
is the only one we've looked at that makes sense, but how effective it is
really comes down to how well we can dedup based on a content-based
chunking algorithm with largish chunks.  Their product (VDO) was very
effective because they chunked at very small granularity (4 KB blocks),
which allowed them to get good dedup ratios, even on content that had
limited similarities at the block layer, but such small chunk sizes aren't
practical in a distributed pool like RADOS.  In their experience (having
tried many different architectural approaches over the last 10 years),
dedup ratios fall off sharply as the chunk sizes increase.
There was a video-streaming use case a while back where potentially the same
video data might be duplicated many times.  Would those still be candidates
for coarse-grained dedup?
Yes... the content-based chunking should work well even with very
large chunk sizes for this use case.


There is a tension in the world of dedup between large chunks - where every byte has to be the same - and small chunk size (more work, but an ability to dedup part even without a full match).

It would be interesting to see for real world data how much VDO can do with small chunks and just sitting under 1 OSD...

ric




The main takeaway was that our big unknown is how well content-based
chunking is going to work on real data sets--specifically, the kinds
of data sets our users do/will have.  The other thought was that it
probably makes sense to look at RGW object workloads initially: given
the read penalty we'll see on deduped objects, users are most likely
to deploy this on colder object data sets.
It fills me with a bit of dread. I guess the question I would have is whether
we actually expect there to be a lot of small object dedup benefit.  I could
see if for (small) image hosting ala facebook I guess.  What other use cases
did the Permabit guys target?
VDO is pretty general-purpose, targetting a single block device on a
single host.  This lets them slot in dedup in cases where you wouldn't
traditionally expect it: with flash, any regular block device that isn't
super performance-sensitive could be deduped and with flash it will still
perform reasonably.

For us, deduping small images would work (they'd be one chunk), but
there will be an additional layer of indirection to map names to
content, and the storage savings are probably not worth it for a small
data set that is performance sensitive.

What I keep coming back to is that I expect that the bulk of the data
stored is going to land in RGW (video, images, etc) or maybe CephFS and
that the data landing in block is probably going to be (1) both a small
fraction of the total data footprint and (2) performance sensitive.

sage


What I'd like to do soon/next is to write a tool that will enumerate rados
objects in a given pool, read each object and apply a chunking algorithm
(e.g., "rabin sliding window with ~128KB target object size"), and then
calculate some stats by building a big in-memory unordered_set (or
similar) of the hash values.  This will allow us to run this on any
cluster with an existing data set to estimate what type of dedup ratios we
can expect.  We can try it with different data sets, different chunk
sizes, and eventually more sophisticated chunking algorithms (e.g., ones
that chunk more intelligently based on recognized file formats, etc.).
We can also make the tool extrapolate its results after scanning only a
portion of the object namespace (say, only a specific range of PGs) since
our objects are uniformately distributed across the pool hash space (and
the tool needs to fit its big hash table of fingerprints in memory).

What do you think?
A tool that could look at an existing cluster and make a greedy and
interruptible estimate of potential dedup ratios at different granularity
levels is absolutely the right way to approach this imho.





[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux