Re: Avery Pennarun's git-subtree?

skillzero@xxxxxxxxx · Sat, 24 Jul 2010 12:40:11 -0700

On Fri, Jul 23, 2010 at 6:20 PM, Avery Pennarun <apenwarr@xxxxxxxxx> wrote:
> On Fri, Jul 23, 2010 at 8:58 PM,  <skillzero@xxxxxxxxx> wrote:
>> On Fri, Jul 23, 2010 at 3:50 PM, Avery Pennarun <apenwarr@xxxxxxxxx> wrote:
>>> Honest question: do you care about the wasted disk space and download
>>> time for these extra files?  Or just the fact that git gets slow when
>>> you have them?
>>
>> I have the similar situation to the original poster (huge trees) and
>> for me it's all three: disk space, download time, and performance. My
>> tree has a few relatively small (< 20 MB) shared directories of common
>> code, a few large (2-6 GB) directories of code for OS's, and then
>> several medium size (< 500 MB) directories for application code. The
>> application developers only care about the app+shared directories (and
>> are very annoyed by the massive space and performance impact of the OS
>> directories).
>
> Given how cheap disk space is nowadays, I'm curious about this.  Are
> they really just annoyed by the performance problem, and they complain
> about the extra size because they blame the performance on the extra
> files?  Or are they honestly short of disk space?

I think it's both space and performance. When you're using SSD drives,
storage still pretty expensive. A 128 GB or less SSD is pretty common
in a laptop so you can run out pretty quick, especially when you're
working concurrently on a few different branches at the same time.
It's useful to keep multiple working copies (e.g. git-new-workdir)
because rebuild time can be significant when switching branches.

> Similarly, are all your developers located at the same office?  If so,
> then bandwidth ought not be an issue.

Bandwidth isn't a big problem because you don't need to re-download
the repo very often. However, people work at home a lot where
bandwidth is more limited. The biggest complaint I hear about
bandwidth is that people tend to re-download when something goes wrong
(i.e. inexperience with git resulting in a repository they can't
recover due to git resets, etc).

> I'm pushing extra hard on this because I believe there are lots of
> opportunities to just improve git performance on huge repositories.
> And if the only *real* reason people need to split repositories is
> that performance goes down, then that's fixable, and you may need
> neither git-submodule nor git-subtree.

Performance degradation is my biggest complaint with large
repositories. Your inotify/FSEvents/etc daemon idea sounds interesting
to deal with the stat issue.

> This is indeed a problem with large repositories.  Of course,
> splitting them with git-submodule is kind of cheating, because it just
> makes git-status *not look* to see if those files are dirty or not.
> If they are dirty and you forget to commit them, you'll never know
> until someone tells you later.  It would be functionally equivalent to
> just have git-status not look inside certain subdirs of a single
> repository.

I think it's only cheating if you're using all of the submodules. The
main purpose of submodules for me (although I don't currently use
submodules) would be so I don't need to keep modules on disk that I
don't care about. If a developer is working on an app, they don't need
the OS directories/modules so they get much faster git status/etc and
there wouldn't be other directories to have dirty files in. That said,
if I was using git submodule, I'd want git status to show me all the
submodules that were checked out.

>> (although just having all those objects in
>> the .git directory still slows it down quite a bit).
>
> You're the second person who has mentioned this today (the first one
> was to me in a private email).  I'd like to understand this better.

What I'm basing this on is that even when I'm using a sparse checkout
such that I have only a small subset of the files in my working
directory, git status seems singifncantly slower for me than an
equivalent git repository that only has that subset of files. That's
not very scientific, but that's what made me think just having a large
.git directory with lots of objects/history slows down git status even
if the working copy doesn't have a lot of files.

I will try to experiment and see if I can narrow it down with some real numbers.

BTW...what's the policy on CC'ing people on git mailing list replies?
Should it be trimmed or not? I've received complaints in the past, but
I was never really clear what the recommended policy is.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html