Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote: > I'm working on a git backend for cvs2svn and had an idea for > git-fast-import that would make life a tiny bit easier: Cool! > Currently, git-fast-import marks are positive integers. But they are > used for two things: marking single-file blobs, and marking commits. > > This is a tiny bit awkward, because cvs2svn assigns small integer IDs to > these things too, but uses distinct (overlapping) integer series for the > two concepts. If it would be trivial to split the marks into two > "namespaces" (one for single-file blobs and one for commits), that would > make things a little bit more natural. I don't think commit marks can > be used interchangeably with blob marks anyway, so it wouldn't be a > backwards incompatibility. That's true, they aren't interchangeable. fast-import pukes and dies if you try to use the wrong type at the wrong location. It has been requested before that the two namespaces be split, and I just have been too lazy to do it. > Without this feature, I will have to assign a new "mark" integer series > that is unrelated to cvs2svn's IDs, which is no big deal at all but will > make debugging a little bit harder. So only add this feature if it is > really easy for you. Its not that much code reorg, but there is some reorg required to make it work. Maybe only a few hundred line diff, so probably well within reason. I'll look into it later. > Also, is there a big cost to using "not-quite-consecutive" integers as > marks? cvs2svn's CVSRevision IDs are intermingled with IDs for > CVSBranches and CVSTags, so the CVSRevisions alone probably only pack > the ID space 5%-50% full. Marks cost exactly 1 pointer (4 or 8 bytes) as they are actually just a pointer to the already-in-memory object metadata that fast-import uses for bookkeeping related to packfile generation. Gaps in the marks sequence also cost exactly 1 pointer, as they are just NULL. But the marks table is actually a sparse array, using 1024 entries per block. So if you assign a mark at :5, then another at say :1047000 you have only allocated 3 blocks and 12 KiB of memory (a root directory block at 4 KiB, two leafs at 4 KiB each). A far cry from 4 MiB. Its not a binary tree, its a sparse digital index. So going really far out in the namespace with huge gaps will cost you some index nodes. Staying reasonably dense is actually quite efficient, with pretty low directory overheads. > In fact, if there is a big cost to "not-quite-consecutive" integers, > then I withdraw my request for separate mark namespaces, since I would > have to reallocate mark numbers anyway :-) See above. 5% full is really bad, because you are probably going to allocate nearly every block in the directory, and only fill each leaf block at 5% full. 50% full is actually reasonable, as it means marks are only costing you about 2 pointers on average (8 or 16 bytes). I went with the sparse array/digital index approach because it is fairly compact code, quick store and lookup operations, and I figured most frontends could get at least 50% full on their mark allocation. On really dense allocations (>60%) the very low overhead per mark makes it insanely efficient, even for a very large number of marks. Jon Smirl was dumping marks sequentially from his hacked cvs2svn, thereby getting the marks table at 100% full. Other recent import attempts with fast-import have also managed to keep their mark allocations pretty close (if not dead on) at 100% full. I can see how it might be convenient to have a very sparsely filled mark namespace. Its also convenient to have a mark namespace that uses arbitrary strings. Unfortunately I chose not to support those very well (or at all!) for the sake of trying to keep the fast-import code more compact internally, and to simplify its internal memory management. You might be able to talk me into improving on that however. ;-) > Another thing that might help with debugging would be a "comment" > command, which git-fast-import should ignore. One could put text about > the source of a chunk of git-fast-import stream to relate it back to the > front-end concepts when debugging the stream contents by hand. This is an awesome idea, especially when combined with having a buffer of the last few commands that fast-import saw right before it crashed. I'll see what I can do. -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html