Re: RFC: [PATCH] Support incremental pack files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Feb 23, 2007 at 12:10:35AM -0800, Junio C Hamano wrote:
> mkoegler@xxxxxxxxxxxxxxxxx (Martin Koegler) writes:
> 
> > Commiting a new version in GIT increases the storage by the compressed
> > size of each changed blob. Packing all unpacked objects decreases the
> > required storage, but does not generate deltas against objects in
> > packs. You need to repack all objects to get around this.
> >
> > For normal source code, this is not a problem.  But if you want to use
> > git for big files, you waste storage (or CPU time for everything
> > repacking).
> 
> Three points that might help you without any code change.
> 
>  - Have you run "git repack -a -d" without "-f"?  Reusing of
>    existing delta is specifically designed to avoid the "CPU
>    time for everything repacking" problem.
> 
>  - If you are dealing with something other than "normal source
>    code", do you know if your objects delta against each other
>    well?  If not, turning core.legacyheaders off might be a
>    win.  It allows the objects that are recorded as non-delta in
>    resulting pack to be copied straight from loose objects.

I currently use CVS to save the daily changes in database dumps (files
mostly containing INSERT INTO xx (...) VALUES (...);). I'm trying to
switch this to git.

A commit typically consists of some files with a size of > 100 MB and
are growing every day. (All unpacked blob objects of) A commit require
currently about 60 MB. A incremental pack file containing one commit
is smaller than 1 MB, so the delta works well.

>  - Once you accumulated large enough packs with existing
>    objects, marking them with .keep would leave them untouched
>    during subsequent repack.  When "git repack -a -d" repacks
>    "everything", its definition of "everything" becomes "except
>    things that are in packs marked with .keep files".
> 
> Side note: Is the .keep mechanism sufficiently documented?  I am
> too lazy to check that right now, but here is a tip.  After
> releasing the big one, line v1.5.0, I do:

I have not found any notice of this in the git documentation.

>   $ P=.git/objects/pack
>   $ git rev-list --objects v1.5.0 |
>     git pack-objects --delta-base-offset \
>           --depth=30 --window=100 --no-reuse-delta pack
>   ...
>   6fba5cb8ed92dfef71ff47def9f95fa1e703ba59
>   $ mv pack-6fba5cb8ed92dfef71ff47def9f95fa1e703ba59.* $P/
>   $ echo 'Post 1.5.0' >$P/pack-6fba5cb8ed92dfef71ff47def9f95fa1e703ba59.keep
>   $ git gc --prune
> 
> This does three things:
> 
>  - It packs everything reachable from v1.5.0 with delta chain
>    that is deeper than the default.
> 
>  - The pack is installed in the object store; the presence of
>    .keep file (the contents of it does not matter) tells
>    subsequent repack not to touch it.
> 
>  - Then the remaining objects are packed into different pack.
> 
> With this, the repository uses two packs, one is what I'll keep
> until it's time to do the big repack again, another is what's
> constantly recreated by repacking but contains only "recent"
> object.

This could be a practical solution for me. The biggest disadvantage
of this solution is, that each pack file is at least >= 60 MB.

A nice feature of git is, that it normally does not change files,
which keeps incremental backups small. I want to retain this, so I
want avoid uncessary repacking.

As I have no tags, I can base the repacking decision only on file
size:

  * Daily: Mark all packs >= eg. 100 MB as keep and repack the
           repository.
  * Weekly/Monthly/Yearly: repack repository including packs of the
           next size class.

My first idea was to write a script, which delete all keep files,
recreates them for packs bigger than a specified size and the starts
git-repack.

As git-repack already calls find, this could be easly added to the
script:

--- git-repack  2007-02-17 18:06:09.000000000 +0100
+++ git-repack1 2007-02-26 22:09:12.000000000 +0100
@@ -8,11 +8,12 @@
 . git-sh-setup

 no_update_info= all_into_one= remove_redundant=
-local= quiet= no_reuse_delta= extra=
+local= quiet= no_reuse_delta= extra= sizearg=
 while case "$#" in 0) break ;; esac
 do
        case "$1" in
        -n)     no_update_info=t ;;
+       -s)     sizearg="-size -${2}k" ; shift; ;;
        -a)     all_into_one=t ;;
        -d)     remove_redundant=t ;;
        -q)     quiet=-q ;;
@@ -46,7 +47,7 @@
        ;;
 ,t,)
        if [ -d "$PACKDIR" ]; then
-               for e in `cd "$PACKDIR" && find . -type f -name '*.pack' \
+               for e in `cd "$PACKDIR" && find . -type f $sizearg -name '*.pack' \
                        | sed -e 's/^\.\///' -e 's/\.pack$//'`
                do
                        if [ -e "$PACKDIR/$e.keep" ]; then


> > It only permits, that the base commit of a delta is located in a
> > different pack or as unpacked object.
> 
> This "only" change needs to be done _very_ carefully, since
> self-containedness of pack files is one of the important
> elements of the stability of a git repository.

I understand the problems. GIT would need at least a list of external
base objects in the pack to speed up things like eg. git-prune.

mfg Martin Kögler
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]