On Thu, Mar 10, 2011 at 10:40:01PM +0100, Alexander Miseler wrote: > "While git can handle arbitrary-sized binary content [...]" > > This is very much not true. Git tries at many places to load the > complete file into memory and usually fails with "out of memory" if it > can't. With the 32bit msysGit client this places the upper file size > limit, from purely empirical observation, at 600-700 MByte. I think we are picking nits here. What I meant was two things: 1. The fundamental design of git does not prevent storing arbitrary-sized binary data. 2. How big a piece of data the current implementation can handle sanely is up to how much hardware you throw at it. On my 64-bit machine with 8G of RAM running Linux, I can easily work with 2 gigabyte files. Some operations are slow, of course, but it works. I'm willing to accept that 32-bit msysgit has more trouble with a case like that. But I think we are probably in agreement with what needs to be done to make things better. Specifically, I am thinking of: 1. Streaming blobs wherever possible (e.g., add, filters, textconv). 2. Converting the diff code not to have in-memory files is probably going to be quite difficult. But most of these files don't have interesting diffs _anyway_. They're usually binary, and we don't generate binary diffs by default. So what we need to focus on is avoiding loading them when we can. Things like: a. Using caching textconv, and when we do run the textconv filter, streaming the blob to the filter. b. Avoid loading the whole file to check whether it is binary. We can already avoid this by marking it binary with gitattributes, but there is no reason git can't just load the first 4K or so to check for binary-ness, and get this optimization automatically. We can also consider caching binary-ness for large files so we don't have to look at them at all after the first time. c. Handle rename detection better. It may be a matter of saying "this file is too big for detection". But we may also be able to stream it through the spantree-hashing, and then possibly also cache the resultant data. 3. The above deal with memory problems. There is also a storage problem. If I have a 100G repo, right now I use at least 100G in the .git directory and 100G in the working tree. That's a problem for repos of that size. If I have a storage server on the LAN and want to accept the latency hit, it would be nice to keep the commits local and the giant blobs on the server. Especially coupled with the optimizations in (2), we can possibly avoid even having to touch those blobs at all in many cases, so the latency wouldn't be a big deal. 4. A related storage problem is that we put big files in packs, and the I/O on rewriting the pack becomes a problem. They would do better loose or in their own packs. And I'm sure there are more variations on those things. Part of the project would be identifying the problem areas. > Even worse yet, commits consisting of smaller files but with a > combined size over the limit will also cause out-of-memories. That generally should work OK. The diff and packing code tries to keep memory usage reasonable, which generally equates to two times the largest file. If you have a test case that shows problems, there may very well be a bug. > Thus a main focus should be the memory problem, e.g. by using > stream-like file handling everywhere, since not working at all is > orders of magnitude worse than working slowly :) Agreed. I think they are sort of the same problem. Whether it works slowly or not at all is simply a matter of how much memory you have. ;) > Ironically git add is one of the few things that work with large > files, as mentioned above. Presumably the stream-oriented zlib > enforced/encouraged a steam-like handling here :) Slow as hell though > and of course it is usually not sensible to compress a 1.5 GByte file. I just tried "git add" on a 2G file of random bytes. It took about a minute or so to calculate the sha1 and compress it, but the memory usage did jump to 2G. So we could obviously do better on the memory, and there is almost certainly no point in zlib compressing something that big. In my case, it was obviously just random junk. But most files of that size are already going to have some kind of lossy compression, so we are just wasting CPU. You can always set core.compression, but I really just want it off for certain files. > I'm very willing to work on this topic. Though I'm not a student and > as a git code newbie I also don't have the skills for mentoring yet. It's on my agenda, too. We'll see if a student steps up for the GSoC project. But don't let that stop you if you want to take a look at it; I'm sure there is plenty of work to go around. :) -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html