Re: pack operation is thrashing my server

Nicolas Pitre <nico@xxxxxxx> · Wed, 13 Aug 2008 13:26:14 -0400 (EDT)

On Wed, 13 Aug 2008, Geert Bosch wrote:

> On Aug 13, 2008, at 10:35, Nicolas Pitre wrote:
> > On Tue, 12 Aug 2008, Geert Bosch wrote:
> > 
> > > I've always felt that keeping largish objects (say anything >1MB)
> > > loose makes perfect sense. These objects are accessed infrequently,
> > > often binary or otherwise poor candidates for the delta algorithm.
> > 
> > Or, as I suggested in the past, they can be grouped into a separate
> > pack, or even occupy a pack of their own.
> 
> This is fine, as long as we're not trying to create deltas
> of the large objects, or do other things that requires keeping
> the inflated data in memory.

First, there is the delta attribute:

This could probably be extended to take a size limit argument as well.

> > As soon as you have more than
> > one revision of such largish objects then you lose again by keeping them
> > loose.
> 
> Yes, you lose potentially in terms of disk space, but you avoid the
> large memory footprint during pack generation. For very large blobs,
> it is best to degenerate to having each revision of each file on
> its own (whether we call it a single-file pack, loose object or whatever).
> That way, the large file can stay immutable on disk, and will only
> need to be accessed during checkout. GIT will then scale with good
> performance until we run out of disk space.

Loose objects, though, will always be selected for potential delta 
generation.  Packed objects, deltified or not, are always streamed as is 
when serving pull requests.  And by default delta compression is not 
(re)attempted between objects which are part of the same pack, the 
reason being that if they were not deltified on the first packing 
attempt then there is no point trying again when streaming them over the 
net. So you always benefit from having your large objects packed with 
the rest.  This, plus the delta prevention mechanism above should cover 
most cases.

> > You'll have memory usage issues whenever such objects are accessed,
> > loose or not.
> Why? The only time we'd need to access their contents for checkout
> or when pushing across the network. These should all be steaming operations
> with small memory footprint.

Pushing across the network, or repacking without -f, is streamed.  
Checking out currently isn't (although it probably could).  Repacking 
with -f definitely isn't and probably shouldn't because of complexity 
issues.

> > However, once those big objects are packed once, they can
> > be repacked (or streamed over the net) without really "accessing" them.
> > Packed object data is simply copied into a new pack in that case which
> > is less of an issue on memory usage, irrespective of the original pack
> > size.
> Agreed, but still, at least very large objects. If I have a 600MB
> file in my repository, it should just not get in the way. If it gets
> copied around during each repack, that just wastes I/O time for no
> good reason. Even worse, it causes incremental backups or filesystem
> checkpoints to become way more expensive. Just leaving large files
> alone as immutable objects on disk avoids all these issues.

Pack them in a pack of their own and stick a .keep file along with it.  
At that point they will never be rewritten.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html