The short story: I wanted to import into git a dataset consisting of a single directory with 300,000 files. I tried using git-fast-import, but it wasn't able to handle the large directory size. This patchset optimizes the algorithms used for tree handling, and I get orders of magnitude improvements in memory and CPU consumption. The patches are (see commit messages for more explanation): 1. grow tree storage more aggressively 2. code rearranging to make patch 3 easier to read 3. keep tree entries sorted and use binary instead of linear searches The long story, with numbers: Originally I just tried git-fast-import from 'next'. It built the pack file (about 65M) from the blobs after a few minutes, and then while building the commit, consumed all system memory (about 1G) and crashed. The culprit was the constant increase in allocation as the tree size grew, coupled with failure to pass allocated pool memory back to the OS. Patch 1 doubles the allocated size each time we run out of space. With patch 1, the memory usage was much more reasonable (it ends up using about 46M). However, the process still ran for over an hour before I killed it (bear in mind that doing deltas on all of the blobs takes about 5 minutes). The culprit this time was the linear search through the tree entries looking to see if each 'M' line was a new entry or an update. Patch 3 turns this into a binary search. To do some testing, I cut my original dataset down to 20,000 entries, which I could feasibly do with the stock git-fast-import. Here are the numbers: For reference, just adding the blobs using stock git-fast-import without making a commit (the memory report is the "Memory total" from gfi): mem: 2673 KiB 5.86user 3.67system 0:09.57elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+405366minor)pagefaults 0swaps Now here's stock git-fast-import making the commit (note the memory) mem: 101992 KiB 37.07user 4.15system 0:41.55elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+430469minor)pagefaults 0swaps Now here's with just patch 1 (better memory, but still slow): mem: 3688 KiB 30.00user 3.73system 0:34.80elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+406064minor)pagefaults 0swaps And with patches 1, 2, and 3: mem: 3688 KiB 6.08user 3.71system 0:10.10elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+406064minor)pagefaults 0swaps And my final 300,000 item dataset with patches 1, 2, and 3: mem: 46378 KiB 414.17user 69.82system 8:11.92elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+7730960minor)pagefaults 0swaps Yes, this dataset is pathological. But I suspect the speed improvements will help even modest projects a little, and almost certainly not hurt (the aggressive memory growth will probably waste a bit more memory). -Peff - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html