Re: Avery Pennarun's git-subtree?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 23, 2010 at 8:58 PM,  <skillzero@xxxxxxxxx> wrote:
> On Fri, Jul 23, 2010 at 3:50 PM, Avery Pennarun <apenwarr@xxxxxxxxx> wrote:
>> Honest question: do you care about the wasted disk space and download
>> time for these extra files?  Or just the fact that git gets slow when
>> you have them?
>
> I have the similar situation to the original poster (huge trees) and
> for me it's all three: disk space, download time, and performance. My
> tree has a few relatively small (< 20 MB) shared directories of common
> code, a few large (2-6 GB) directories of code for OS's, and then
> several medium size (< 500 MB) directories for application code. The
> application developers only care about the app+shared directories (and
> are very annoyed by the massive space and performance impact of the OS
> directories).

Given how cheap disk space is nowadays, I'm curious about this.  Are
they really just annoyed by the performance problem, and they complain
about the extra size because they blame the performance on the extra
files?  Or are they honestly short of disk space?

Similarly, are all your developers located at the same office?  If so,
then bandwidth ought not be an issue.

I'm pushing extra hard on this because I believe there are lots of
opportunities to just improve git performance on huge repositories.
And if the only *real* reason people need to split repositories is
that performance goes down, then that's fixable, and you may need
neither git-submodule nor git-subtree.

> I work on all of the pieces, but even I would
> prefer to have things separated so when I work on the apps, git
> status/etc doesn't take a big hit for close to a million files in the
> OS directories (particularly when doing git status on Windows). Even
> when using the -uno option to git status, it's still pretty slow (over
> a minute).

This is indeed a problem with large repositories.  Of course,
splitting them with git-submodule is kind of cheating, because it just
makes git-status *not look* to see if those files are dirty or not.
If they are dirty and you forget to commit them, you'll never know
until someone tells you later.  It would be functionally equivalent to
just have git-status not look inside certain subdirs of a single
repository.

In any case, this is a pretty clear optimization target (especially
since Windows is so amazingly slow at statting files): just have a
daemon running inotify (or the Windows equivalent) that tracks whether
files are up-to-date or not.  Then git would never need to recurse
through the entire tree, and operations like status, diff, checkout,
and commit could be fast even with a million-file repository.

> git-subtree could also possibly help, but there's still extra work to
> split and merge each repository. And I'm not sure how it handles
> commit IDs across the repositories because I want to be able to say "I
> fixed that bug in shared/code.c in commit abc123" and have both the
> OS+shared and the apps+shared people be able git log abc123 and see
> the same change (and merge/cherry-pick/etc.).

git-subtree (if you don't use --squash) keeps all the commit IDs.  It
is extra work to split and merge between repositories, though.  It
doesn't solve your repository-is-too-large problem.

> I think what I want is a way to do a sparse checkout where some sort
> of module is maintained in the git repository (probably just an
> INI-style file with paths) so I can clone directly from the server and
> it figures out the objects I need for the full history of only
> apps+shared (or firmware+shared, etc.) on the server side and only
> sends those objects. I still want to be able to branch, tag, and refer
> to commit IDs. So I only take the space/download/performance hit of
> directories included in the module, but I don't have to manually
> maintain that view of the repository (as I do with git-submodule and
> git-subtree).

Yes, better sparse checkout and sparse fetch would be very valuable
here and would eliminate a lot of the reasons people have for misusing
submodules.

> (although just having all those objects in
> the .git directory still slows it down quite a bit).

You're the second person who has mentioned this today (the first one
was to me in a private email).  I'd like to understand this better.

In my bup project (http://github.com/apenwarr/bup) we regularly create
git repositories with hundreds of gigabytes of packs, comprising tens
or hundreds of millions of objects, and the repository doesn't get
slow.  (Obviously this is a separate issue from having a huge work
tree with a million files in it.)  In repositories this thoroughly
huge, we did find a way to improve memory usage versus git's pack .idx
files (bup has '.midx' files that combine multiple indexes into one,
thus reducing the binary search steps).  But this only matters when
you get well over 10 gigabytes of stuff and you're wading through it
using crappy python code (as bup does) and frequently inserting a
million objects at a time (as bup does).  The git usage pattern is
much simpler and therefore faster.

How big is your .git directory and what performance problems do you
see?  I assume you've done 'git gc' to clean up all the loose objects,
right?

Have fun,

Avery
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]