Re: [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target

Vivek Goyal <vgoyal@xxxxxxxxxx> · Fri, 30 Jan 2015 10:56:39 -0500

On Fri, Jan 23, 2015 at 11:27:39AM -0500, Vasily Tarasov wrote:

[..]
> > - Why did you implement an inline deduplication as opposed to out-of-line
> >   deduplication? Section 2 (Timeliness) in paper just mentioned
> >   out-of-line dedup but does not go into more details that why did you
> >   choose an in-line one.
> >
> >   I am wondering that will it not make sense to first implement an
> >   out-of-line dedup and punt lot of cost to worker thread (which kick
> >   in only when storage is idle). That way even if don't get a high dedup
> >   ratio for a workload, inserting a dedup target in the stack will be less
> >   painful from performance point of view.
> 
> Both in-line and off-line deduplication approaches have their own
> pluses and minuses. Among the minuses of  the off-line approach is
> that it requires allocation of extra space to buffer non-deduplicated
> writes,

Well, that extra space requirement is temporary. So you got to pay the cost
somewhere. Personally, I will be more than happy to consume more disk
space when I am writing and not take a hit and let worker threads optimize
space usage later.

> re-reading the data from disk when deduplication happens (i.e.
> more I/O used).

Worker threads are supposed to kick in when disk is idle so it might not
be as big a concern.

> It also complicates space usage accounting and user
> might run out of space though deduplication process will discover many
> duplicated blocks later.

Anyway, user needs to plan for extra space. De-dup is not exact science
and one does not know how much will be the de-dup ratio in a data set.

> 
> Our final goal is to support both approaches but for this code
> submission we wanted to limit the amount of new code. In-line
> deduplication is a core part, around which we can implement off-line
> dedup by adding an extra thread that will reuse the same logic as
> in-line deduplication.

Ok. I am fine with building both if that makes sense. 

I also understand that there are pros/cons to both the approaches. Just
that given the higt cost of inline dedupe, I am finding it little odd
that it be implemented first as opposed to offline one. 

Anyway, I will spend some time on patches now.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel