Re: [PATCH] Keep last used delta base in the delta window

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 27, 2007 at 12:07:58PM +0200, David Kastrup wrote:
> Well, given the little amount of spare time I have for personal
> projects, I should not go flaunting them too much, but anyway...
> 
> I am just drafting implementing a delta-diff size estimator: for every
> hash value it does not store hash chains or pointers to memory images,
> but just a bit map of at most 32 (possibly 64 where ulong delivers it,
> possibly even just 16 bits when one restricts oneself to 16 element
> windows in order to save memory) delta window files, telling which of
> the files show such a hash value anywhere. 

How do you want to select these few hash values? 

I eg. dump database tables as SQL statements in per table files and
commit all changes. All lines in each file share a common prefix
(INSERT INTO tablename (COL1, COL2, ...) VALUES). Statistics showed, that up
to 50% of the hash elements of the delta index were dropped, because
they had eqal has values.

With you apropach:

If rare used values are selected as the 32 hash values, you will
probably regard files, which contain different tables (and therefore a
different prefix), as "similar", although the files have not very much
similarity otherwise.

If you select the hash values of the prefixes, you will create for each
table one cluster (if the table count is < 32/64). This already happens
by using the hash value of the path name in the sorting.

I don't belive, that your idea will improve very much in this case.
I fear, that it will be even worser than the current algorithm.

> When a file is considered
> for deltaing, it is run once against this delta-diff estimator which
> does not even look at the files again.  This delivers a lower bound
> for the size a delta towards each of the delta candidates can take.
> The candidates are then sorted in increasing size of expected delta
> size and are considered in turn.  Once a delta has a lower size than
> the lowest bound for further delta candidates, no further candidates
> need to get considered.

So this means, that we have to hash the target entry too (which could
be merged with creating the delta index). We currently only hash an
entry, if it is really needed as source in a delta-diff.

On bigger blobs, this could slow the process down.

> Also the calculation of a delta can get aborted once it exceeds the
> size of a previous delta. 

This already happens.

> Somewhat more complex and error-prone would
> be to abort based on continually correcting the estimated lower bound
> from the estimator while one is creating a delta and dropping out as
> soon as it becomes impossible to beat a previous delta.
> 
> The nice thing about this is that one can start out with a simplistic
> estimator as long as it is guaranteed never to deliver an estimate
> that is too high.  As long as that is the case, the packs will never
> get worse than they are now: a bad estimator will just mean that more
> elements of the window need to be considered before the conclusion is
> reached.
> 
> Since the estimator data structure does not point into any
> memory-mapped files or hash chains scattered through memory, it should
> be comparatively light (in comparison to an actual delta) on page
> thrashing which appears to me one of the main performance problems,
> while delivering a reasonably useful lower bound for all deltas in the
> window at once.

Do you want to do your estimations only against a small window of objects
or all objects?

mfg Martin Kögler
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux