Re: pack operation is thrashing my server

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Aug 13, 2008, at 10:35, Nicolas Pitre wrote:
On Tue, 12 Aug 2008, Geert Bosch wrote:

I've always felt that keeping largish objects (say anything >1MB)
loose makes perfect sense. These objects are accessed infrequently,
often binary or otherwise poor candidates for the delta algorithm.

Or, as I suggested in the past, they can be grouped into a separate
pack, or even occupy a pack of their own.

This is fine, as long as we're not trying to create deltas
of the large objects, or do other things that requires keeping
the inflated data in memory.

As soon as you have more than
one revision of such largish objects then you lose again by keeping them
loose.

Yes, you lose potentially in terms of disk space, but you avoid the
large memory footprint during pack generation. For very large blobs,
it is best to degenerate to having each revision of each file on
its own (whether we call it a single-file pack, loose object or whatever).
That way, the large file can stay immutable on disk, and will only
need to be accessed during checkout. GIT will then scale with good
performance until we run out of disk space.

The alternative is that people need to keep large binary data out
of their SCMs and handle it on the side. Consider a large web site
where I have all scripts, HTML content, as well as a few movies
to manage. The movies basically should be copied and stored, only
to be accessed when a checkout (or push) is requested.

If we mix the very large movies with the 100,000 objects representing
the webpages, the resulting pack will become unwieldy and slow even
to just copy around during repacks.

You'll have memory usage issues whenever such objects are accessed,
loose or not.
Why? The only time we'd need to access their contents for checkout
or when pushing across the network. These should all be steaming operations
with small memory footprint.

 However, once those big objects are packed once, they can
be repacked (or streamed over the net) without really "accessing" them.
Packed object data is simply copied into a new pack in that case which
is less of an issue on memory usage, irrespective of the original pack
size.
Agreed, but still, at least very large objects. If I have a 600MB
file in my repository, it should just not get in the way. If it gets
copied around during each repack, that just wastes I/O time for no
good reason. Even worse, it causes incremental backups or filesystem
checkpoints to become way more expensive. Just leaving large files
alone as immutable objects on disk avoids all these issues.

  -Geert
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux