Re: [PATCH] pack-objects: never deltify objects bigger than window_memory_limit.

Avery Pennarun <apenwarr@xxxxxxxxx> · Wed, 22 Sep 2010 22:01:45 -0700

On Wed, Sep 22, 2010 at 5:00 AM, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
> On Wed, 22 Sep 2010, Avery Pennarun wrote:
> To not load big objects into memory, we'd have to add support for the
> core.bigFileThreshold config option in more places.

Well, it could be automatic in git-pack-objects if it was enabled when
the object is larger than window_memory_limit.  But I see your point.

Anyway, it requires more knowledge of the guts of git (plus patience
and motivation) than I currently have.  My patch was sufficient to get
my previously-un-packable repo packed on my system, which was a big
step forward.

>> +             if (window_memory_limit && entry->size > window_memory_limit)
>> +                     continue;
>
> I think you should even use entry->size/2 here, or even entry->size/4.
> The reason for that is 1) you need at least 2 such similar objects in
> memory to find a possible delta, and 2) reference object to delta
> against has to be block indexed and that index table is almost the same
> size as the object itself especially on 64-bit machines.

I would be happy to re-spin my exciting two line patch to include
whatever threshold you think it best :)

Good point about #1; dividing by two does seem like a good idea.
(Though arguably you could construct a small object from a large one
just by deleting a lot of stuff, and a small + large object could fit
in a window close to the large object's size.  I doubt this happens
frequently, though.)

As for #2, does the block index table count toward the
window_memory_limit right now?  If it doesn't, then dividing by 4
doesn't seem to make sense.  If it does, then it does make sense, I
guess.  (Though maybe dividing by 3 is a bit more generous just in
case.)

Anyway, the window_size_limit stuff seems pretty weakly implemented
overall; with or without my patch, it seems like you could end up
using vastly more memory than the window size.  Imagine setting the
window_memory_limit to 10MB and packing a 1GB object; it still ends up
in memory at least two or three times, for a 200x overshoot.  Without
my patch, it's about 2x worse than even that, though I don't know why.

So my question is: is it worth it to try to treat window_memory_limit
as "maximum memory used by the pack operation" or should it literally
be the maximum size of the window?  If the former, well, we've got a
lot of work ahead of us, but dividing by 4 might be appropriate.  If
the latter, dividing by two is probably the most we should do.

FWIW, the test timings described in my commit message are unaffected
whether we divide by 1, 2, or 4, because as it happens, I didn't have
any objects between 25 and 100 MB in size :)

Have fun,

Avery
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html