Nicolas Pitre <nico@xxxxxxx> wrote: > I looked at it too and the code looks OK. Based on the stats Jon provided me it appeared as though the code skipped about 9% of the objects, and at the time I thought I new why but looking at it again right now I'm not sure why. :-) Assuming the objects were roughly the same size (probably a bad assumption, but lets use it anyway) we're talking about approx. 9% increase, bringing the size to around 360 MB. Since that is still only blobs and lacks trees&commits, and you were able to get a pack from the earlier import attempts down to 430 MB (including trees&commits) everything you pointed out with regards to delta base selection not being optimal is probably the major culprit here. > So I think that trying to make fastimport too clever is wrong. It > should instead focus on creating an initial pack as fast as possible and > then rely on a final git-repack pass to produce the shrinked pack. I > really doubt the import code could ever make a better job than > git-pack-objects does. I'm mixed on that. My goal with fast-import was to construct a pack as cheaply as possible in such a way that IO is relatively minimized during pack construction and later during an initial repacking, as well as to avoid all fork'ing costs associated with many update-index and repack calls. fast-import is assuming the front end driving it is being somewhat nice by feeding revisions in newest-oldest order, one file at a time. If this assumption is false it will produce a pack but one that is so large the IO to repack it will be horrible (but probably would still be better than if these were all loose objects). fast-import is also assuming that pack construction speed and memory usage required to build that pack are more important than subsequent reads of the resulting pack. Why? Because I fully expected the pack to undergo a `git-repack -a -d -f` with a rather large delta depth before it really got used by clients which cared about performance or disk space. :-) Whether or not git-repack can handle such a large pack is another story. Apparently it can already take a ~600 MB pack and do something useful to it in a reasonable time span so I'm putting any limits that might be in there off for now. But the more I work with Jon on this Mozilla import process the more I'm realizing that: - fast-import.c's other name is "GIT in 1325 lines of C code, all of which more or less already exists elsewhere in a form that wasn't useful in this context without refactoring it in major ways so I got lazy and rewrote it in another equally useless way"; - the strategy of how I'm slamming a very large number of objects into a pack may be useful in situations other than a foreign SCM import process. I can see someone wanting to create a large commit with a lot of modified objects. Doing this with update-index and write-tree into loose objects would be far slower than just generating a new pack if the number of objects you are writing exceeds about 100 on Windows or ~1k on UNIX; - Jon has lots of CPU time to spare and possibly even some memory. fast-import is barely using CPU and barely uses memory (its quite lightweight!) but its definately producing a suboptimal pack as a result. Just doing slightly better delta selection before committing an object to the output pack may prevent the need to use -f during git-repack and produce something almost as optimal. Right now fast-import is only lacking the following major features, at which point it *should* be able to process the entire Mozilla repository, assuming Jon can get his symbol closing problem in cvs2svn solved: - reload a branch which has been swapped out (requires reading the pack that is currently being constructed back into memory, which since its not really a valid pack yet and lacks an index reusing the code in sha1_file.c is slightly interesting); - generate a tag object into the pack; - track association ID numbers to SHA1 IDs (needed to properly create a branch from a prior commit or a tag from a prior commit as the prior commit's SHA ID isn't known to the front end); Working on the multiple objects in one zlib stream code over the weekend also showed me how to use sha1_file.c to perform the first of the three (I think) and the remaining two are trivial. So my weekend wasn't a total loss. :-) I plan on getting that code finished up in the next day or so and get a new draft of it off to Jon. Maybe by then he'll also have something coming out of cvs2svn's pass 9. > > As comparison, I just tar/zipped the Mozilla CVS repo and it is 541MB. > > The 295MB git pack number does not have commits and trees in it, it is > > revisions only. > > Running git-repack -a -f from a recent GIT on the Mozilla repo converted > through cvsps and friends produces a pack smaller than 500MB. I even > brought it down to 430MB by using non default delta window and depth. I'd still like to find a way to beat that 430 MB current best-yet-still-valid size. :-) I think we'll get there. Shaving off 4-5% with a pack specific dictionary may be worthwhile on such large packs, especially ones that are considered "historical" and probably will never get repacked again. Folding objects into a single zlib stream may also be worth it, but for now I'm going to wait and see what mileage we can get from the existing repack code and the existing pack file format. -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html