Thanks for this series, and sorry for the delayed (and partial) review. I hope comments here are still useful. I don't have any comments on the larger design that other people haven't already made, but I do have a few small questions and suggestions, inline below. On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > > The first job to implement in the 'git run-job' command is the > 'commit-graph' job. It is based on the sequence of events in the > 'commit-graph' job in Scalar [1]. This sequence is as follows: > > 1. git commit-graph write --reachable --split > 2. git commit-graph verify --shallow > 3. If the verify succeeds, stop. > 4. Delete the commit-graph-chain file. > 5. git commit-graph write --reachable --split > > By writing an incremental commit-graph file using the "--split" > option we minimize the disruption from this operation. The default > behavior is to merge layers until the new "top" layer is less than > half the size of the layer below. This provides quick writes "most" > of the time, with the longer writes following a power law > distribution. > > Most importantly, concurrent Git processes only look at the > commit-graph-chain file for a very short amount of time, so they > will verly likely not be holding a handle to the file when we try > to replace it. (This only matters on Windows.) > > If a concurrent process reads the old commit-graph-chain file, but > our job expires some of the .graph files before they can be read, > then those processes will see a warning message (but not fail). > This could be avoided by a future update to use the --expire-time > argument when writing the commit-graph. > > By using 'git commit-graph verify --shallow' we can ensure that > the file we just wrote is valid. This is an extra safety precaution > that is faster than our 'write' subcommand. In the rare situation > that the newest layer of the commit-graph is corrupt, we can "fix" > the corruption by deleting the commit-graph-chain file and rewrite > the full commit-graph as a new one-layer commit graph. This does > not completely prevent _that_ file from being corrupt, but it does > recompute the commit-graph by parsing commits from the object > database. In our use of this step in Scalar and VFS for Git, we > have only seen this issue arise because our microsoft/git fork > reverted 43d3561 ("commit-graph write: don't die if the existing > graph is corrupt" 2019-03-25) for a while to keep commit-graph > writes very fast. We dropped the revert when updating to v2.23.0. > The verify still has potential for catching corrupt data across > the layer boundary: if the new file has commit X with parent Y > in an old file but the commit ID for Y in the old file had a > bitswap, then we will notice that in the 'verify' command. > > [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs > > RFC QUESTIONS: > > 1. Are links like [1] helpful? Yes, I appreciate being able to reference these. However, given that these will show up in commit logs for ~forever, and ongoing work might happen on Scalar, perhaps the links should be pinned to a specific revision? > > 2. Can anyone think of a way to test the rewrite fallback? > It requires corrupting the latest file between two steps of > this one command, so that is a tricky spot to inject a fault. > > Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > --- > Documentation/git-run-job.txt | 21 ++++++++++-- > builtin/run-job.c | 60 ++++++++++++++++++++++++++++++++++- > commit-graph.c | 2 +- > commit-graph.h | 1 + > t/t7900-run-job.sh | 32 +++++++++++++++++++ > 5 files changed, 112 insertions(+), 4 deletions(-) > > diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt > index 0627b3ed259..8bf2762d650 100644 > --- a/Documentation/git-run-job.txt > +++ b/Documentation/git-run-job.txt > @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation. > SYNOPSIS > -------- > [verse] > -'git run-job <job-name>' > +'git run-job commit-graph' > > > DESCRIPTION > @@ -28,7 +28,24 @@ BEHAVIOR MAY BE ALTERED IN THE FUTURE. > JOBS > ---- > > -TBD > +'commit-graph':: > + > +The `commit-graph` job updates the `commit-graph` files incrementally, > +then verifies that the written data is correct. If the new layer has an > +issue, then the chain file is removed and the `commit-graph` is > +rewritten from scratch. > ++ > +The verification only checks the top layer of the `commit-graph` chain. > +If the incremental write merged the new commits with at least one > +existing layer, then there is potential for on-disk corruption being > +carried forward into the new file. This will be noticed and the new > +commit-graph file will be clean as Git reparses the commit data from > +the object database. > ++ > +The incremental write is safe to run alongside concurrent Git processes > +since it will not expire `.graph` files that were in the previous > +`commit-graph-chain` file. They will be deleted by a later run based on > +the expiration delay. > > > GIT > diff --git a/builtin/run-job.c b/builtin/run-job.c > index 2c78d053aa4..dd7709952d3 100644 > --- a/builtin/run-job.c > +++ b/builtin/run-job.c > @@ -1,12 +1,65 @@ > #include "builtin.h" > #include "config.h" > +#include "commit-graph.h" > +#include "object-store.h" > #include "parse-options.h" > +#include "repository.h" > +#include "run-command.h" > > static char const * const builtin_run_job_usage[] = { > - N_("git run-job"), > + N_("git run-job commit-graph"), > NULL > }; > > +static int run_write_commit_graph(void) > +{ > + struct argv_array cmd = ARGV_ARRAY_INIT; > + > + argv_array_pushl(&cmd, "commit-graph", "write", > + "--split", "--reachable", "--no-progress", > + NULL); > + > + return run_command_v_opt(cmd.argv, RUN_GIT_CMD); > +} > + > +static int run_verify_commit_graph(void) > +{ > + struct argv_array cmd = ARGV_ARRAY_INIT; > + > + argv_array_pushl(&cmd, "commit-graph", "verify", > + "--shallow", "--no-progress", NULL); > + > + return run_command_v_opt(cmd.argv, RUN_GIT_CMD); > +} > + > +static int run_commit_graph_job(void) > +{ > + char *chain_path; > + > + if (run_write_commit_graph()) { > + error(_("failed to write commit-graph")); > + return 1; > + } > + > + if (!run_verify_commit_graph()) > + return 0; > + > + warning(_("commit-graph verify caught error, rewriting")); > + > + chain_path = get_chain_filename(the_repository->objects->odb); Should we avoid new uses of `the_repository` and take a repo pointer argument here instead? > + if (unlink(chain_path)) { > + UNLEAK(chain_path); > + die(_("failed to remove commit-graph at %s"), chain_path); > + } > + free(chain_path); > + > + if (!run_write_commit_graph()) > + return 0; Is there a reason why we don't verify the second write attempt? > + > + error(_("failed to rewrite commit-graph")); > + return 1; > +} > + > int cmd_run_job(int argc, const char **argv, const char *prefix) > { > static struct option builtin_run_job_options[] = { > @@ -23,6 +76,11 @@ int cmd_run_job(int argc, const char **argv, const char *prefix) > builtin_run_job_usage, > PARSE_OPT_KEEP_UNKNOWN); > > + if (argc > 0) { > + if (!strcmp(argv[0], "commit-graph")) > + return run_commit_graph_job(); > + } > + > usage_with_options(builtin_run_job_usage, > builtin_run_job_options); > } > diff --git a/commit-graph.c b/commit-graph.c > index f013a84e294..6867f92d04a 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -56,7 +56,7 @@ static char *get_split_graph_filename(struct object_directory *odb, > oid_hex); > } > > -static char *get_chain_filename(struct object_directory *odb) > +char *get_chain_filename(struct object_directory *odb) > { > return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path); > } > diff --git a/commit-graph.h b/commit-graph.h > index e87a6f63600..8b7bd5a6cb1 100644 > --- a/commit-graph.h > +++ b/commit-graph.h > @@ -13,6 +13,7 @@ > struct commit; > > char *get_commit_graph_filename(struct object_directory *odb); > +char *get_chain_filename(struct object_directory *odb); > int open_commit_graph(const char *graph_file, int *fd, struct stat *st); > > /* > diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh > index 1eac80b7ed3..18b9bd26b3a 100755 > --- a/t/t7900-run-job.sh > +++ b/t/t7900-run-job.sh > @@ -5,6 +5,8 @@ test_description='git run-job > Testing the background jobs, in the foreground > ' > > +GIT_TEST_COMMIT_GRAPH=0 > + > . ./test-lib.sh > > test_expect_success 'help text' ' > @@ -12,4 +14,34 @@ test_expect_success 'help text' ' > test_i18ngrep "usage: git run-job" err > ' > > +test_expect_success 'commit-graph job' ' > + git init server && > + ( > + cd server && > + chain=.git/objects/info/commit-graphs/commit-graph-chain && > + git checkout -b master && > + for i in $(test_seq 1 15) > + do > + test_commit $i || return 1 > + done && > + git run-job commit-graph 2>../err && > + test_must_be_empty ../err && > + test_line_count = 1 $chain && > + for i in $(test_seq 16 19) > + do > + test_commit $i || return 1 > + done && > + git run-job commit-graph 2>../err && > + test_must_be_empty ../err && > + test_line_count = 2 $chain && > + for i in $(test_seq 20 23) > + do > + test_commit $i || return 1 > + done && > + git run-job commit-graph 2>../err && > + test_must_be_empty ../err && > + test_line_count = 1 $chain > + ) > +' > + > test_done > -- > gitgitgadget >