2018년 10월 9일 (화) 오전 12:19, Ric Wheeler <rwheeler@xxxxxxxxxx>님이 작성: > > On 10/8/18 9:28 AM, Sage Weil wrote: > > On Mon, 8 Oct 2018, Mark Nelson wrote: > >> On 10/08/2018 07:44 AM, Sage Weil wrote: > >>> Hi Myoungwon, > >>> > >>> I had a good conversation last week with Corwin Coburn and Jered Floyd > >>> (two of the dedup experts who joined Red Hat last year as part of the > >>> Permabit acquisition) about distributed dedup in Ceph. > >>> > >>> The gist of our conversation is that the approach we're currently taking > >>> is the only one we've looked at that makes sense, but how effective it is > >>> really comes down to how well we can dedup based on a content-based > >>> chunking algorithm with largish chunks. Their product (VDO) was very > >>> effective because they chunked at very small granularity (4 KB blocks), > >>> which allowed them to get good dedup ratios, even on content that had > >>> limited similarities at the block layer, but such small chunk sizes aren't > >>> practical in a distributed pool like RADOS. In their experience (having > >>> tried many different architectural approaches over the last 10 years), > >>> dedup ratios fall off sharply as the chunk sizes increase. > >> There was a video-streaming use case a while back where potentially the same > >> video data might be duplicated many times. Would those still be candidates > >> for coarse-grained dedup? > > Yes... the content-based chunking should work well even with very > > large chunk sizes for this use case. > > > There is a tension in the world of dedup between large chunks - where every byte > has to be the same - and small chunk size (more work, but an ability to dedup > part even without a full match). > > It would be interesting to see for real world data how much VDO can do with > small chunks and just sitting under 1 OSD... > > ric > > > > > > >>> The main takeaway was that our big unknown is how well content-based > >>> chunking is going to work on real data sets--specifically, the kinds > >>> of data sets our users do/will have. The other thought was that it > >>> probably makes sense to look at RGW object workloads initially: given > >>> the read penalty we'll see on deduped objects, users are most likely > >>> to deploy this on colder object data sets. > >> It fills me with a bit of dread. I guess the question I would have is whether > >> we actually expect there to be a lot of small object dedup benefit. I could > >> see if for (small) image hosting ala facebook I guess. What other use cases > >> did the Permabit guys target? > > VDO is pretty general-purpose, targetting a single block device on a > > single host. This lets them slot in dedup in cases where you wouldn't > > traditionally expect it: with flash, any regular block device that isn't > > super performance-sensitive could be deduped and with flash it will still > > perform reasonably. > > > > For us, deduping small images would work (they'd be one chunk), but > > there will be an additional layer of indirection to map names to > > content, and the storage savings are probably not worth it for a small > > data set that is performance sensitive. > > > > What I keep coming back to is that I expect that the bulk of the data > > stored is going to land in RGW (video, images, etc) or maybe CephFS and > > that the data landing in block is probably going to be (1) both a small > > fraction of the total data footprint and (2) performance sensitive. > > > > sage > > > > > >>> What I'd like to do soon/next is to write a tool that will enumerate rados > >>> objects in a given pool, read each object and apply a chunking algorithm > >>> (e.g., "rabin sliding window with ~128KB target object size"), and then > >>> calculate some stats by building a big in-memory unordered_set (or > >>> similar) of the hash values. This will allow us to run this on any > >>> cluster with an existing data set to estimate what type of dedup ratios we > >>> can expect. We can try it with different data sets, different chunk > >>> sizes, and eventually more sophisticated chunking algorithms (e.g., ones > >>> that chunk more intelligently based on recognized file formats, etc.). > >>> We can also make the tool extrapolate its results after scanning only a > >>> portion of the object namespace (say, only a specific range of PGs) since > >>> our objects are uniformately distributed across the pool hash space (and > >>> the tool needs to fit its big hash table of fingerprints in memory). > >>> > >>> What do you think? > >> A tool that could look at an existing cluster and make a greedy and > >> interruptible estimate of potential dedup ratios at different granularity > >> levels is absolutely the right way to approach this imho. > >> > The tool you mentioned is essential to use dedup tier efficiently. I will think about both that tool and scrub, then I will make a new PR (WIP). We can discuss more detail at the time Myoungwon