On Fri, Feb 23, 2007 at 12:10:35AM -0800, Junio C Hamano wrote: > mkoegler@xxxxxxxxxxxxxxxxx (Martin Koegler) writes: > > > Commiting a new version in GIT increases the storage by the compressed > > size of each changed blob. Packing all unpacked objects decreases the > > required storage, but does not generate deltas against objects in > > packs. You need to repack all objects to get around this. > > > > For normal source code, this is not a problem. But if you want to use > > git for big files, you waste storage (or CPU time for everything > > repacking). > > Three points that might help you without any code change. > > - Have you run "git repack -a -d" without "-f"? Reusing of > existing delta is specifically designed to avoid the "CPU > time for everything repacking" problem. > > - If you are dealing with something other than "normal source > code", do you know if your objects delta against each other > well? If not, turning core.legacyheaders off might be a > win. It allows the objects that are recorded as non-delta in > resulting pack to be copied straight from loose objects. I currently use CVS to save the daily changes in database dumps (files mostly containing INSERT INTO xx (...) VALUES (...);). I'm trying to switch this to git. A commit typically consists of some files with a size of > 100 MB and are growing every day. (All unpacked blob objects of) A commit require currently about 60 MB. A incremental pack file containing one commit is smaller than 1 MB, so the delta works well. > - Once you accumulated large enough packs with existing > objects, marking them with .keep would leave them untouched > during subsequent repack. When "git repack -a -d" repacks > "everything", its definition of "everything" becomes "except > things that are in packs marked with .keep files". > > Side note: Is the .keep mechanism sufficiently documented? I am > too lazy to check that right now, but here is a tip. After > releasing the big one, line v1.5.0, I do: I have not found any notice of this in the git documentation. > $ P=.git/objects/pack > $ git rev-list --objects v1.5.0 | > git pack-objects --delta-base-offset \ > --depth=30 --window=100 --no-reuse-delta pack > ... > 6fba5cb8ed92dfef71ff47def9f95fa1e703ba59 > $ mv pack-6fba5cb8ed92dfef71ff47def9f95fa1e703ba59.* $P/ > $ echo 'Post 1.5.0' >$P/pack-6fba5cb8ed92dfef71ff47def9f95fa1e703ba59.keep > $ git gc --prune > > This does three things: > > - It packs everything reachable from v1.5.0 with delta chain > that is deeper than the default. > > - The pack is installed in the object store; the presence of > .keep file (the contents of it does not matter) tells > subsequent repack not to touch it. > > - Then the remaining objects are packed into different pack. > > With this, the repository uses two packs, one is what I'll keep > until it's time to do the big repack again, another is what's > constantly recreated by repacking but contains only "recent" > object. This could be a practical solution for me. The biggest disadvantage of this solution is, that each pack file is at least >= 60 MB. A nice feature of git is, that it normally does not change files, which keeps incremental backups small. I want to retain this, so I want avoid uncessary repacking. As I have no tags, I can base the repacking decision only on file size: * Daily: Mark all packs >= eg. 100 MB as keep and repack the repository. * Weekly/Monthly/Yearly: repack repository including packs of the next size class. My first idea was to write a script, which delete all keep files, recreates them for packs bigger than a specified size and the starts git-repack. As git-repack already calls find, this could be easly added to the script: --- git-repack 2007-02-17 18:06:09.000000000 +0100 +++ git-repack1 2007-02-26 22:09:12.000000000 +0100 @@ -8,11 +8,12 @@ . git-sh-setup no_update_info= all_into_one= remove_redundant= -local= quiet= no_reuse_delta= extra= +local= quiet= no_reuse_delta= extra= sizearg= while case "$#" in 0) break ;; esac do case "$1" in -n) no_update_info=t ;; + -s) sizearg="-size -${2}k" ; shift; ;; -a) all_into_one=t ;; -d) remove_redundant=t ;; -q) quiet=-q ;; @@ -46,7 +47,7 @@ ;; ,t,) if [ -d "$PACKDIR" ]; then - for e in `cd "$PACKDIR" && find . -type f -name '*.pack' \ + for e in `cd "$PACKDIR" && find . -type f $sizearg -name '*.pack' \ | sed -e 's/^\.\///' -e 's/\.pack$//'` do if [ -e "$PACKDIR/$e.keep" ]; then > > It only permits, that the base commit of a delta is located in a > > different pack or as unpacked object. > > This "only" change needs to be done _very_ carefully, since > self-containedness of pack files is one of the important > elements of the stability of a git repository. I understand the problems. GIT would need at least a list of external base objects in the pack to speed up things like eg. git-prune. mfg Martin Kögler - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html