Re: Summer of Code project ideas due this Friday

Jeff King <peff@xxxxxxxx> · Thu, 10 Mar 2011 17:18:09 -0500

On Thu, Mar 10, 2011 at 10:40:01PM +0100, Alexander Miseler wrote:

> "While git can handle arbitrary-sized binary content [...]"
> 
> This is very much not true. Git tries at many places to load the
> complete file into memory and usually fails with "out of memory" if it
> can't. With the 32bit msysGit client this places the upper file size
> limit, from purely empirical observation, at 600-700 MByte.

I think we are picking nits here. What I meant was two things:

  1. The fundamental design of git does not prevent storing
     arbitrary-sized binary data.

  2. How big a piece of data the current implementation can handle
     sanely is up to how much hardware you throw at it. On my 64-bit
     machine with 8G of RAM running Linux, I can easily work with 2
     gigabyte files. Some operations are slow, of course, but it works.

     I'm willing to accept that 32-bit msysgit has more trouble with
     a case like that.

But I think we are probably in agreement with what needs to be done to
make things better. Specifically, I am thinking of:

  1. Streaming blobs wherever possible (e.g., add, filters, textconv).

  2. Converting the diff code not to have in-memory files is probably
     going to be quite difficult. But most of these files don't have
     interesting diffs _anyway_. They're usually binary, and we don't
     generate binary diffs by default. So what we need to focus on is
     avoiding loading them when we can. Things like:

       a. Using caching textconv, and when we do run the textconv
          filter, streaming the blob to the filter.

       b. Avoid loading the whole file to check whether it is binary. We
          can already avoid this by marking it binary with
          gitattributes, but there is no reason git can't just load the
          first 4K or so to check for binary-ness, and get this
          optimization automatically. We can also consider caching
          binary-ness for large files so we don't have to look at them
          at all after the first time.

       c. Handle rename detection better. It may be a matter of saying
          "this file is too big for detection". But we may also be able
          to stream it through the spantree-hashing, and then possibly
          also cache the resultant data.

  3. The above deal with memory problems. There is also a storage
     problem. If I have a 100G repo, right now I use at least 100G in
     the .git directory and 100G in the working tree. That's a problem
     for repos of that size. If I have a storage server on the LAN and
     want to accept the latency hit, it would be nice to keep the
     commits local and the giant blobs on the server. Especially coupled
     with the optimizations in (2), we can possibly avoid even having to
     touch those blobs at all in many cases, so the latency wouldn't be
     a big deal.

  4. A related storage problem is that we put big files in packs, and
     the I/O on rewriting the pack becomes a problem. They would do
     better loose or in their own packs.

And I'm sure there are more variations on those things. Part of the
project would be identifying the problem areas.

> Even worse yet, commits consisting of smaller files but with a
> combined size over the limit will also cause out-of-memories.

That generally should work OK. The diff and packing code tries to keep
memory usage reasonable, which generally equates to two times the
largest file. If you have a test case that shows problems, there may
very well be a bug.

> Thus a main focus should be the memory problem, e.g. by using
> stream-like file handling everywhere, since not working at all is
> orders of magnitude worse than working slowly :)

Agreed. I think they are sort of the same problem. Whether it works
slowly or not at all is simply a matter of how much memory you have. ;)

> Ironically git add is one of the few things that work with large
> files, as mentioned above. Presumably the stream-oriented zlib
> enforced/encouraged a steam-like handling here :) Slow as hell though
> and of course it is usually not sensible to compress a 1.5 GByte file.

I just tried "git add" on a 2G file of random bytes. It took about a
minute or so to calculate the sha1 and compress it, but the memory usage
did jump to 2G. So we could obviously do better on the memory, and there
is almost certainly no point in zlib compressing something that big. In
my case, it was obviously just random junk. But most files of that size
are already going to have some kind of lossy compression, so we are just
wasting CPU. You can always set core.compression, but I really just want
it off for certain files.

> I'm very willing to work on this topic. Though I'm not a student and
> as a git code newbie I also don't have the skills for mentoring yet.

It's on my agenda, too. We'll see if a student steps up for the GSoC
project. But don't let that stop you if you want to take a look at it;
I'm sure there is plenty of work to go around. :)

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html