Re: dedup next steps

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



2018년 10월 9일 (화) 오전 12:19, Ric Wheeler <rwheeler@xxxxxxxxxx>님이 작성:
>
> On 10/8/18 9:28 AM, Sage Weil wrote:
> > On Mon, 8 Oct 2018, Mark Nelson wrote:
> >> On 10/08/2018 07:44 AM, Sage Weil wrote:
> >>> Hi Myoungwon,
> >>>
> >>> I had a good conversation last week with Corwin Coburn and Jered Floyd
> >>> (two of the dedup experts who joined Red Hat last year as part of the
> >>> Permabit acquisition) about distributed dedup in Ceph.
> >>>
> >>> The gist of our conversation is that the approach we're currently taking
> >>> is the only one we've looked at that makes sense, but how effective it is
> >>> really comes down to how well we can dedup based on a content-based
> >>> chunking algorithm with largish chunks.  Their product (VDO) was very
> >>> effective because they chunked at very small granularity (4 KB blocks),
> >>> which allowed them to get good dedup ratios, even on content that had
> >>> limited similarities at the block layer, but such small chunk sizes aren't
> >>> practical in a distributed pool like RADOS.  In their experience (having
> >>> tried many different architectural approaches over the last 10 years),
> >>> dedup ratios fall off sharply as the chunk sizes increase.
> >> There was a video-streaming use case a while back where potentially the same
> >> video data might be duplicated many times.  Would those still be candidates
> >> for coarse-grained dedup?
> > Yes... the content-based chunking should work well even with very
> > large chunk sizes for this use case.
>
>
> There is a tension in the world of dedup between large chunks - where every byte
> has to be the same - and small chunk size (more work, but an ability to dedup
> part even without a full match).
>
> It would be interesting to see for real world data how much VDO can do with
> small chunks and just sitting under 1 OSD...
>
> ric
>
>
>
> >
> >>> The main takeaway was that our big unknown is how well content-based
> >>> chunking is going to work on real data sets--specifically, the kinds
> >>> of data sets our users do/will have.  The other thought was that it
> >>> probably makes sense to look at RGW object workloads initially: given
> >>> the read penalty we'll see on deduped objects, users are most likely
> >>> to deploy this on colder object data sets.
> >> It fills me with a bit of dread. I guess the question I would have is whether
> >> we actually expect there to be a lot of small object dedup benefit.  I could
> >> see if for (small) image hosting ala facebook I guess.  What other use cases
> >> did the Permabit guys target?
> > VDO is pretty general-purpose, targetting a single block device on a
> > single host.  This lets them slot in dedup in cases where you wouldn't
> > traditionally expect it: with flash, any regular block device that isn't
> > super performance-sensitive could be deduped and with flash it will still
> > perform reasonably.
> >
> > For us, deduping small images would work (they'd be one chunk), but
> > there will be an additional layer of indirection to map names to
> > content, and the storage savings are probably not worth it for a small
> > data set that is performance sensitive.
> >
> > What I keep coming back to is that I expect that the bulk of the data
> > stored is going to land in RGW (video, images, etc) or maybe CephFS and
> > that the data landing in block is probably going to be (1) both a small
> > fraction of the total data footprint and (2) performance sensitive.
> >
> > sage
> >
> >
> >>> What I'd like to do soon/next is to write a tool that will enumerate rados
> >>> objects in a given pool, read each object and apply a chunking algorithm
> >>> (e.g., "rabin sliding window with ~128KB target object size"), and then
> >>> calculate some stats by building a big in-memory unordered_set (or
> >>> similar) of the hash values.  This will allow us to run this on any
> >>> cluster with an existing data set to estimate what type of dedup ratios we
> >>> can expect.  We can try it with different data sets, different chunk
> >>> sizes, and eventually more sophisticated chunking algorithms (e.g., ones
> >>> that chunk more intelligently based on recognized file formats, etc.).
> >>> We can also make the tool extrapolate its results after scanning only a
> >>> portion of the object namespace (say, only a specific range of PGs) since
> >>> our objects are uniformately distributed across the pool hash space (and
> >>> the tool needs to fit its big hash table of fingerprints in memory).
> >>>
> >>> What do you think?
> >> A tool that could look at an existing cluster and make a greedy and
> >> interruptible estimate of potential dedup ratios at different granularity
> >> levels is absolutely the right way to approach this imho.
> >>
>

The tool you mentioned is essential to use dedup tier efficiently.
I will think about both that tool and scrub, then I will make a new PR (WIP).
We can discuss more detail at the time

Myoungwon




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux