Hi, I wrote git-annex, and pristine-tar, and etckeeper. I enjoy making git do things that I'm told it shouldn't be used for. :) I should have probably talked more about git-annex here, before. Eric Montellese wrote: > 2. zipped tarballs of source code (that I will never need to modify) > -- I could unpack these and then use git to track the source code. > However, I prefer to track these deliverables as the tarballs > themselves because it makes my customer happier to see the exact > tarball that they delivered being used when I repackage updates. > (Let's not discuss problems with this model - I understand that this > is non-ideal). In this specific case, you can use pristine-tar to recreate the original, exact tarballs from unpacked source files that you check into git. It accomplishes this without the overhead of duplicating compressed data in tarballs. I feel in this case, this is a better approach than generic large file support, since it stores all the data in git, just in a much more compressed form, and so fits in nicely with standard git-based source code management. > The short version: > ***Don't track binaries in git. Track their hashes.*** That was my principle with git-annex. Although slightly generalized to: "Don't track large file contents in git. Track unique keys that an arbitrary backend can use to obtain the file contents." Now, you mention in a followup that git-annex does not default to keeping a local copy of every binary referenced by a file in master. This is true, for the simple reason that a copy of every file in some of my git repos master would sum to multiple terabytes of data. :) I think that practically, anything that supports large files in git needs to support partial checkouts too. But, git-annex can be run in eg, a post-merge hook, and asked to retrieve all current file contents, and drop outdated contents. > First the layout: > my_git_project/binrepo/ > -- binaries/ > -- hashes/ > -- symlink_to_hashes_file > -- symlink_to_another_hashes_file > within the "binrepo" (binary repository) there is a subdirectory for > binaries, and a subdirectory for hashes. In the root of the 'binrepo' > all of the files stored have a symlink to the current version of the > hash. Very similar to git-annex in the use of versioned symlinks here. It stores the binaries in .git/annex/objects to avoid needing to gitignore them. > 3. In my setup, all of the binary files are in a single "binrepo" > directory. If done from within git, we would need a non-kludgey way > to allow large binaries to exist anywhere within the git tree. git-annex allows the symlinks to be mixed with regular git managed content throughout the repository. (This means that when symlinks are moved, they may need to be fixed, which is done at commit time.) > 5. Command to purge all binaries in your "binrepo" that are not needed > for the current revision (if you're running out of disk space > locally). Safely dropping data is really one of the complexities of this approach. Git-annex stores location tracking information in git, so it can know where it can retrieve file data *from*. I chose to make it very cautious about removing data, as location tracking data can fall out of date (if for example, a remote had the data, had dropped it, and has not pushed that information out). So it actively confirms that enough other copies of the data currently exist before dropping it. (Of course, these checks can be disabled.) > 6. Automatically upload new versions of files to the "binrepo" (rather > than needing to do this manually) In git-annex, data transfer is done using rsync, so that interrupted transfers of large files can be resumed. I recently added a git-annex-shell to support locked-down access, similar to git-shell. BTW, I have been meaning to look into using smudge filters with git-annex. I'm a bit worried about some of the potential overhead associated with smudge filters, and I'm not sure how a partial checkout would work with them. -- see shy jo
Attachment:
signature.asc
Description: Digital signature