Re: [PATCH v3 07/14] commit-graph: update graph-head during write

Derrick Stolee <stolee@xxxxxxxxx> · Mon, 12 Feb 2018 16:24:55 -0500

On 2/12/2018 3:37 PM, Junio C Hamano wrote:
Junio C Hamano <gitster@xxxxxxxxx> writes:

Derrick Stolee <stolee@xxxxxxxxx> writes:

It is possible to have multiple commit graph files in a pack directory,
but only one is important at a time. Use a 'graph_head' file to point
to the important file. Teach git-commit-graph to write 'graph_head' upon
writing a new commit graph file.
Why this design, instead of what "repack -a" would do, iow, if there
always is a singleton that is the only one that matters, shouldn't
the creation of that latest singleton just clear the older ones
before it returns control?
Note that I am not complaining---I am just curious why we want to
expose this "there is one relevant one but we keep irrelevant ones
we usually do not look at and need to be garbage collected" to end
users, and also expect readers of the series, resulting code and
docs would have the same puzzled feeling.

Aside: I forgot to mention in my cover letter that the experience around 
the "--delete-expired" flag for "git commit-graph write" is different 
than v2. If specified, we delete all ".graph" files in the pack 
directory other than the one referenced by "graph_head" at the beginning 
of the process or the one written by the process. If these deletes fail, 
then we ignore the failure (assuming that they are being used by another 
Git process). In usual cases, we will delete these expired files in the 
next instance. I believe this matches similar behavior in gc and repack.

-- Back to discussion about the value of "graph_head" --

The current design of using a pointer file (graph_head) is intended to 
have these benefits:

1. We do not need to rely on a directory listing and mtimes to determine 
which graph file to use.

2. If we write a new graph file while another git process is reading the 
existing graph file, we can update the graph_head pointer without 
deleting the file that is currently memory-mapped. (This is why we 
cannot just rely on a canonical file name, such as "the_graph", to store 
the data.)

3. We can atomically change the 'graph_head' file without interrupting 
concurrent git processes. I think this is different from the "repack" 
situation because a concurrent process would load all packfiles in the 
pack directory and possibly have open handles when the repack is trying 
to delete them.

4. We remain open to making the graph file incremental (as the MIDX 
feature is designed to do; see [1]). It is less crucial to have an 
incremental graph file structure (the graph file for the Windows 
repository is currently ~120MB versus a MIDX file of 1.25 GB), but the 
graph_head pattern makes this a possibility.

I tried to avoid item 1 due to personal taste, and since I am storing 
the files in the objects/pack directory (so that listing may be very 
large with a lot of wasted entries). This is less important with our 
pending change of moving the graph files to a different directory. This 
also satisfies items 2 and 3, as long as we never write graph files so 
quickly that we have a collision on mtime.

I cannot think of another design that satisfies item 4.

As for your end user concerns: My philosophy with this feature is that 
end users will never interact with the commit-graph builtin. 99% of 
users will benefit from a repack or GC automatically computing a commit 
graph (when we add that integration point). The other uses for the 
builtin are for users who want extreme control over their data, such as 
code servers and build agents.

Perhaps someone with experience managing large repositories with git in 
a server environment could chime in with some design requirements they 
would need.

Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180107181459.222909-2-dstolee@xxxxxxxxxxxxx/
    [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes