W dniu 2016-06-30 o 00:00, Jeff King pisze: > On Wed, Jun 29, 2016 at 11:49:35PM +0200, Jakub Narębski wrote: > >>> So this is the ideal case for generation numbers (the worst cases are >>> when the things you are looking for are in branchy, close history where >>> the generation numbers don't tell you much; but in such cases the >>> walking is usually not too bad). >> >> There are other approaches (special indices) that help reachability >> queries beside "generation number". > > Yes, though generation numbers can help with more questions (e.g., "what > is the merge base"). Well, those special indices usually need generation number anyway. For example FELINE indices^1 (from "Reachability Queries in Very Large Graphs", http://openproceedings.org/EDBT/2014/paper_166.pdf, that I found when trying to find more about compressed bitmap indices used by Git) also utilize generation numbers to speed up reachability analysis. The idea behind FELINE is to associate every vertex (commit) with a unique ordered pair of natural integers (i.e. two more numbers, in addition to the level aka generation number), so that partial order over N^2 is superset of partial order of graph (DAG of revisions). It can answer in O(1) that nodes are not connected, and cut search space if they are. BTW. some references in this paper ("Related work" section) use PWAH compression scheme - perhaps it would work with EWAH that Git uses? 1. FELINE = Fast rEfined onLINE search ... or fun with acronyms >> By the way, what should happen if you add a replacement (in the git-replace >> meaning) that creates a shortcut, therefore invalidating generation numbers, >> at least in strict sense - committerdate as generation number would be still >> good, I think? > > This is one of the open questions. My older patches turned them off when > replacements and grafts are in effect. Well, if you store the cache of generation numbers in the packfile, or in the index of the packfile, or in the bitmap file, or in separate bitmap-like file, generating them on repack, then of course any grafts or replacements invalidate them... though for low level commands (like object counting) replacements are transparent -- or rather they are (and can be) treated as any other ref for reachability analysis. Well, if there are no grafts, you could still use them for doing "git --no-replace-objects log ...", isn't it? SIDENOTE: should grafts be deprecated and later removed, now that Git has superior alternative in the form of git-replace, and a simple way to convert grafts to replacements? Anyway, if file with mapping from revisions to its generation numbers were stored outside of packfiles, it could simply be updated / rewritten when adding new replacement. You could use one cache for no-replace, and one for replace. Though I do wonder if it would be possible to do such rewrite fast... well, fast enough assuming that adding replacements is rare. Answering my own question: >> committerdate as generation number would be still good, I think? No, it wouldn't: pre replace: 1 <-- 2 <-- 3 <-- 5 <-- \ \----- 4 <-- 6 <-- 7 <-- 8 post replace 1 <-- 2 \-- 3' <-- 5 \ \ \------ 4 <-- 6 <-- 7 <-- 8 <--\ Either the replacement commit would have generation number correct, but its child wouldn't, or its child would have correct generation number but the replacement wouldn't. >>> I have patches that generate and store the numbers at pack time, similar >>> to the way we do the reachability bitmaps. Ah, so those cached generation numbers are generated and stored at pack time. Where you store them: is it a separate file? Bitmap file? Packfile? >>> They're not production ready, >>> but they could probably be made so without too much effort. You wouldn't >>> have ready-made generation numbers for commits since the last full >>> repack, but you can compute them incrementally based on what you do have >>> at a cost linear to the unpacked commits (this is the same for bitmaps). >> >> Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still >> limited to object counting? > > At GitHub we are using them for --contains analysis, along with mass > ahead/behind (e.g., as in https://github.com/gitster/git/branches). My > plan is to send patches upstream, but they need some cleanup first. That would be nice to have, please. Er, is mass ahead/behind something that can be plugged into Git (e.g. for "git branch -v -v"), or is it something GitHub-specific? P.S. Having Git ensure that committerdate (as an epoch) is greater than committerdates of its parents at the commit creation time (with providing warning about time skew, perhaps not doing it if skew is too large) would be not costly. No need to add anything, and it would help future queries to be faster, I think. -- Jakub Narębski -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html