Re: Fwd: Git and Large Binaries: A Proposed Solution

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, I wrote git-annex, and pristine-tar, and etckeeper. I enjoy making
git do things that I'm told it shouldn't be used for. :) I should have
probably talked more about git-annex here, before.

Eric Montellese wrote:
> 2. zipped tarballs of source code (that I will never need to modify)
> -- I could unpack these and then use git to track the source code.
> However, I prefer to track these deliverables as the tarballs
> themselves because it makes my customer happier to see the exact
> tarball that they delivered being used when I repackage updates.
> (Let's not discuss problems with this model - I understand that this
> is non-ideal).

In this specific case, you can use pristine-tar to recreate the
original, exact tarballs from unpacked source files that you check into
git. It accomplishes this without the overhead of duplicating compressed
data in tarballs. I feel in this case, this is a better approach than
generic large file support, since it stores all the data in git, just in a
much more compressed form, and so fits in nicely with standard git-based
source code management.

> The short version:
> ***Don't track binaries in git.  Track their hashes.***

That was my principle with git-annex. Although slightly generalized to:
"Don't track large file contents in git. Track unique keys that
an arbitrary backend can use to obtain the file contents."

Now, you mention in a followup that git-annex does not default to keeping
a local copy of every binary referenced by a file in master.
This is true, for the simple reason that a copy of every file in some of
my git repos master would sum to multiple terabytes of data. :) I think
that practically, anything that supports large files in git needs to
support partial checkouts too.

But, git-annex can be run in eg, a post-merge hook, and asked to
retrieve all current file contents, and drop outdated contents.

> First the layout:
> my_git_project/binrepo/
> -- binaries/
> -- hashes/
> -- symlink_to_hashes_file
> -- symlink_to_another_hashes_file
> within the "binrepo" (binary repository) there is a subdirectory for
> binaries, and a subdirectory for hashes.  In the root of the 'binrepo'
> all of the files stored have a symlink to the current version of the
> hash.

Very similar to git-annex in the use of versioned symlinks here.
It stores the binaries in .git/annex/objects to avoid needing to
gitignore them.

> 3. In my setup, all of the binary files are in a single "binrepo"
> directory.  If done from within git, we would need a non-kludgey way
> to allow large binaries to exist anywhere within the git tree.

git-annex allows the symlinks to be mixed with regular git managed
content throughout the repository. (This means that when symlinks
are moved, they may need to be fixed, which is done at commit time.)

> 5. Command to purge all binaries in your "binrepo" that are not needed
> for the current revision (if you're running out of disk space
> locally).

Safely dropping data is really one of the complexities of this
approach. Git-annex stores location tracking information in git,
so it can know where it can retrieve file data *from*. I chose to make
it very cautious about removing data, as location tracking data can 
fall out of date (if for example, a remote had the data, had dropped it,
and has not pushed that information out). So it actively confirms that
enough other copies of the data currently exist before dropping it.
(Of course, these checks can be disabled.)

> 6. Automatically upload new versions of files to the "binrepo" (rather
> than needing to do this manually)

In git-annex, data transfer is done using rsync, so that interrupted
transfers of large files can be resumed. I recently added a git-annex-shell
to support locked-down access, similar to git-shell.


BTW, I have been meaning to look into using smudge filters with git-annex.
I'm a bit worried about some of the potential overhead associated with
smudge filters, and I'm not sure how a partial checkout would work with
them.

-- 
see shy jo

Attachment: signature.asc
Description: Digital signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]