Re: topological index field for commit objects

Jakub Narębski <jnareb@xxxxxxxxx> · Thu, 30 Jun 2016 12:30:31 +0200

W dniu 2016-06-30 o 00:00, Jeff King pisze:
> On Wed, Jun 29, 2016 at 11:49:35PM +0200, Jakub Narębski wrote:
> 
>>> So this is the ideal case for generation numbers (the worst cases are
>>> when the things you are looking for are in branchy, close history where
>>> the generation numbers don't tell you much; but in such cases the
>>> walking is usually not too bad).
>>
>> There are other approaches (special indices) that help reachability
>> queries beside "generation number".
> 
> Yes, though generation numbers can help with more questions (e.g., "what
> is the merge base").

Well, those special indices usually need generation number anyway. For
example FELINE indices^1 (from "Reachability Queries in Very Large Graphs",
http://openproceedings.org/EDBT/2014/paper_166.pdf, that I found when
trying to find more about compressed bitmap indices used by Git) also
utilize generation numbers to speed up reachability analysis. The idea
behind FELINE is to associate every vertex (commit) with a unique ordered
pair of natural integers (i.e. two more numbers, in addition to the level
aka generation number), so that partial order over N^2 is superset of
partial order of graph (DAG of revisions). It can answer in O(1) that
nodes are not connected, and cut search space if they are.

BTW. some references in this paper ("Related work" section) use PWAH
compression scheme - perhaps it would work with EWAH that Git uses?

1. FELINE = Fast rEfined onLINE search ... or fun with acronyms

>> By the way, what should happen if you add a replacement (in the git-replace
>> meaning) that creates a shortcut, therefore invalidating generation numbers,
>> at least in strict sense - committerdate as generation number would be still
>> good, I think?
> 
> This is one of the open questions. My older patches turned them off when
> replacements and grafts are in effect.

Well, if you store the cache of generation numbers in the packfile, or in
the index of the packfile, or in the bitmap file, or in separate bitmap-like
file, generating them on repack, then of course any grafts or replacements
invalidate them... though for low level commands (like object counting)
replacements are transparent -- or rather they are (and can be) treated as
any other ref for reachability analysis.

Well, if there are no grafts, you could still use them for doing
"git --no-replace-objects log ...", isn't it?

SIDENOTE: should grafts be deprecated and later removed, now that Git has
superior alternative in the form of git-replace, and a simple way to convert
grafts to replacements?

Anyway, if file with mapping from revisions to its generation numbers
were stored outside of packfiles, it could simply be updated / rewritten
when adding new replacement. You could use one cache for no-replace,
and one for replace. Though I do wonder if it would be possible to do
such rewrite fast... well, fast enough assuming that adding replacements
is rare.

Answering my own question:
>> committerdate as generation number would be still good, I think?
No, it wouldn't:

  pre replace:

    1 <-- 2 <-- 3 <-- 5 <--
           \
            \----- 4 <-- 6 <-- 7 <-- 8

  post replace

   1 <-- 2                              \-- 3' <-- 5
          \                              \
           \------ 4 <-- 6 <-- 7 <-- 8 <--\

Either the replacement commit would have generation number correct, but
its child wouldn't, or its child would have correct generation number but
the replacement wouldn't.

>>> I have patches that generate and store the numbers at pack time, similar
>>> to the way we do the reachability bitmaps.

Ah, so those cached generation numbers are generated and stored at pack
time. Where you store them: is it a separate file? Bitmap file? Packfile?

>>>                                            They're not production ready,
>>> but they could probably be made so without too much effort. You wouldn't
>>> have ready-made generation numbers for commits since the last full
>>> repack, but you can compute them incrementally based on what you do have
>>> at a cost linear to the unpacked commits (this is the same for bitmaps).
>>
>> Do Git use EWAH / EWOK bitmaps for reachability analysis, or is it still
>> limited to object counting?
> 
> At GitHub we are using them for --contains analysis, along with mass
> ahead/behind (e.g., as in https://github.com/gitster/git/branches). My
> plan is to send patches upstream, but they need some cleanup first.

That would be nice to have, please.

Er, is mass ahead/behind something that can be plugged into Git
(e.g. for "git branch -v -v"), or is it something GitHub-specific?

P.S. Having Git ensure that committerdate (as an epoch) is greater
than committerdates of its parents at the commit creation time (with
providing warning about time skew, perhaps not doing it if skew is
too large) would be not costly. No need to add anything, and it would
help future queries to be faster, I think.

-- 
Jakub Narębski

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html