Derrick Stolee <stolee@xxxxxxxxx> writes: > From: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > Subject: [PATCH v8 08/14] commit-graph: implement git commit-graph read Minor nit: this is one commit message [subject] among all others that uses "git commit-graph" instead of "git-commit-graph" in the description. Also, perhaps this (and similarly titled commits in this series) would read better with quotes, that is as: commit-graph: implement "git commit-graph read" Though that might be a matter of personal taste. > > Teach git-commit-graph to read commit graph files and summarize their contents. > > Use the read subcommand to verify the contents of a commit graph file in the > tests. Better would be, in my opinion Use the 'read' subcommand or Use the "read" subcommand > > Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > --- > Documentation/git-commit-graph.txt | 12 +++ > builtin/commit-graph.c | 56 ++++++++++++ > commit-graph.c | 137 ++++++++++++++++++++++++++++- > commit-graph.h | 23 +++++ > t/t5318-commit-graph.sh | 32 +++++-- > 5 files changed, 254 insertions(+), 6 deletions(-) > > diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt > index 47996e8f89..8aad8303f5 100644 > --- a/Documentation/git-commit-graph.txt > +++ b/Documentation/git-commit-graph.txt > @@ -9,6 +9,7 @@ git-commit-graph - Write and verify Git commit graph files > SYNOPSIS > -------- > [verse] > +'git commit-graph read' [--object-dir <dir>] > 'git commit-graph write' <options> [--object-dir <dir>] Why do you need this '[--object-dir <dir>]' parameter? Anyway, because Git has the GIT_OBJECT_DIRECTORY environment variable support, I would expect '--object-dir' to be parameter to the 'git' wrapper/command, like '--git-dir' is, not to the 'git commit-graph' command, or even only its selected individual subcommands. > > > @@ -35,6 +36,11 @@ COMMANDS > Write a commit graph file based on the commits found in packfiles. > Includes all commits from the existing commit graph file. > > +'read':: > + > +Read a graph file given by the commit-graph file The above part of sentence reads very strange, as a truism. > and output basic > +details about the graph file. Used for debugging purposes. I would say that it is 'used' for testing, and is 'useful' (or 'can be used') for debugging purposes. > + > > EXAMPLES > -------- > @@ -45,6 +51,12 @@ EXAMPLES > $ git commit-graph write > ------------------------------------------------ > > +* Read basic information from the commit-graph file. > ++ > +------------------------------------------------ > +$ git commit-graph read > +------------------------------------------------ I would personally prefer to have example output together with example calling convention. > + > > GIT > --- > diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c > index 26b6360289..efd39331d7 100644 > --- a/builtin/commit-graph.c > +++ b/builtin/commit-graph.c > @@ -7,10 +7,16 @@ > > static char const * const builtin_commit_graph_usage[] = { > N_("git commit-graph [--object-dir <objdir>]"), > + N_("git commit-graph read [--object-dir <objdir>]"), > N_("git commit-graph write [--object-dir <objdir>]"), > NULL > }; > > +static const char * const builtin_commit_graph_read_usage[] = { > + N_("git commit-graph read [--object-dir <objdir>]"), > + NULL > +}; > + > static const char * const builtin_commit_graph_write_usage[] = { > N_("git commit-graph write [--object-dir <objdir>]"), > NULL > @@ -20,6 +26,54 @@ static struct opts_commit_graph { > const char *obj_dir; > } opts; > > +static int graph_read(int argc, const char **argv) > +{ > + struct commit_graph *graph = NULL; > + char *graph_name; > + > + static struct option builtin_commit_graph_read_options[] = { > + OPT_STRING(0, "object-dir", &opts.obj_dir, > + N_("dir"), > + N_("The object directory to store the graph")), Actually it is not the object directory to store the graph, but it is the object directory to read the commit-graph file from. > + OPT_END(), > + }; > + > + argc = parse_options(argc, argv, NULL, > + builtin_commit_graph_read_options, > + builtin_commit_graph_read_usage, 0); > + > + if (!opts.obj_dir) > + opts.obj_dir = get_object_directory(); > + > + graph_name = get_commit_graph_filename(opts.obj_dir); > + graph = load_commit_graph_one(graph_name); > + > + if (!graph) > + die("graph file %s does not exist", graph_name); It might be better to use single quotes around '%s'; this is absolute pathname (if I understand it correctly), and it may contain spaces in it. > + FREE_AND_NULL(graph_name); > + > + printf("header: %08x %d %d %d %d\n", Wouldn't it be better to print signature charactes (FourCC-like), that is 'CGPH'? And maybe name each part of header? + printf("header: %c%c%c%c ver=%d hash=%d chunks=%d reserved=%d\n", Would it make using the command in tests harder, maybe? > + ntohl(*(uint32_t*)graph->data), > + *(unsigned char*)(graph->data + 4), > + *(unsigned char*)(graph->data + 5), > + *(unsigned char*)(graph->data + 6), > + *(unsigned char*)(graph->data + 7)); > + printf("num_commits: %u\n", graph->num_commits); All right. > + printf("chunks:"); > + > + if (graph->chunk_oid_fanout) > + printf(" oid_fanout"); > + if (graph->chunk_oid_lookup) > + printf(" oid_lookup"); > + if (graph->chunk_commit_data) > + printf(" commit_metadata"); > + if (graph->chunk_large_edges) > + printf(" large_edges"); > + printf("\n"); This means that there is no support for unknown chunks (perhaps created by newer version of Git - that does not exist yet), including unknown optional chunks. But I guess that is acceptable at this stage. Note that for unknown chunks you would be able to only print their signatures, because we do not know their full names. > + > + return 0; > +} > + No unmap, no closing file descriptor; I guess we can rely on operating system doing this cleanup for us on exit. [...] > +static struct commit_graph *alloc_commit_graph(void) > +{ > + struct commit_graph *g = xcalloc(1, sizeof(*g)); All right, that is the standard idiom used by git code. > + g->graph_fd = -1; > + > + return g; > +} Would we need some safe way of deallocating graph data? Who owns graph_fd, and is responsible for closing the file (well, except system when program exits - but what about libgit2 then)? > + > +struct commit_graph *load_commit_graph_one(const char *graph_file) > +{ > + void *graph_map; > + const unsigned char *data, *chunk_lookup; > + size_t graph_size; > + struct stat st; > + uint32_t i; > + struct commit_graph *graph; > + int fd = git_open(graph_file); > + uint64_t last_chunk_offset; > + uint32_t last_chunk_id; > + uint32_t graph_signature; > + unsigned char graph_version, hash_version; > + > + if (fd < 0) > + return NULL; > + if (fstat(fd, &st)) { > + close(fd); > + return NULL; > + } > + graph_size = xsize_t(st.st_size); > + > + if (graph_size < GRAPH_MIN_SIZE) { > + close(fd); > + die("graph file %s is too small", graph_file); Should we print its expected minimal size, too? Shouldn't error messages be marked for localization? > + } > + graph_map = xmmap(NULL, graph_size, PROT_READ, MAP_PRIVATE, fd, 0); > + data = (const unsigned char *)graph_map; All right, speed is important, so let's (x)mmap the file. > + > + graph_signature = get_be32(data); > + if (graph_signature != GRAPH_SIGNATURE) { > + error("graph signature %X does not match signature %X", > + graph_signature, GRAPH_SIGNATURE); > + goto cleanup_fail; > + } All right, we check the signature of the file. > + > + graph_version = *(unsigned char*)(data + 4); I wonder if those numbers should not be replaced by preprocessor constants. I guess it wouldn't actually improve readability. > + if (graph_version != GRAPH_VERSION) { > + error("graph version %X does not match version %X", > + graph_version, GRAPH_VERSION); > + goto cleanup_fail; > + } Does this mean that the command is not forward-compatibile, in that it would fail on "commit-graph" files created in newer version of Git, then accessed with older version? > + > + hash_version = *(unsigned char*)(data + 5); > + if (hash_version != GRAPH_OID_VERSION) { > + error("hash version %X does not match version %X", > + hash_version, GRAPH_OID_VERSION); > + goto cleanup_fail; > + } All right, there is no support for NewHash yet, so there is nothing to do but fail. > + > + graph = alloc_commit_graph(); > + > + graph->hash_len = GRAPH_OID_LEN; > + graph->num_chunks = *(unsigned char*)(data + 6); > + graph->graph_fd = fd; > + graph->data = graph_map; > + graph->data_len = graph_size; > + > + last_chunk_id = 0; > + last_chunk_offset = 8; > + chunk_lookup = data + 8; > + for (i = 0; i < graph->num_chunks; i++) { > + uint32_t chunk_id = get_be32(chunk_lookup + 0); > + uint64_t chunk_offset = get_be64(chunk_lookup + 4); > + int chunk_repeated = 0; > + > + chunk_lookup += GRAPH_CHUNKLOOKUP_WIDTH; All right, here we use preprocessor constant (I would guess: 4 + 8). > + > + if (chunk_offset > graph_size - GIT_MAX_RAWSZ) { All right, there must be place for final H-byte HASH-checksum of all of contents. > + error("improper chunk offset %08x%08x", (uint32_t)(chunk_offset >> 32), And by "improper" you mean "too large" here. Why the strange formatting of uint64_t / off64_t values? Is it compatibility reasons? > + (uint32_t)chunk_offset); > + goto cleanup_fail; > + } > + > + switch (chunk_id) { > + case GRAPH_CHUNKID_OIDFANOUT: > + if (graph->chunk_oid_fanout) > + chunk_repeated = 1; > + else > + graph->chunk_oid_fanout = (uint32_t*)(data + chunk_offset); All right, this is the only currently defined chunk where the element is a simple type, it would be always the same simple type, and we know this type. Not so for the rest of chunks: either the element is composite type, or the size of element can change in the future (like hash size). Sidenote: for verification one would probably have to check that: - the size of oid_fanout chunk is 256 * 4 bytes - that 0 <= F[0] <= F[1] <= ... <= F[255] = num_commits > + break; > + > + case GRAPH_CHUNKID_OIDLOOKUP: > + if (graph->chunk_oid_lookup) > + chunk_repeated = 1; > + else > + graph->chunk_oid_lookup = data + chunk_offset; > + break; Sidenote: for verification one would probably have to check that: - the size of oid_lookup is N * H bytes, where N = num_commits - the OIDs are sorted in ascending lexicographical order - that each objects with given OID exists, and is a commit object Though the problem is that we may not know num_commits with this way of reading at this time. > + > + case GRAPH_CHUNKID_DATA: > + if (graph->chunk_commit_data) > + chunk_repeated = 1; > + else > + graph->chunk_commit_data = data + chunk_offset; > + break; Sidenote: for verification one would probably have to check that: - the size of oid_lookup is N * (H + 16) bytes, where N = num_commits - that data in here agrees with data from the ODB > + > + case GRAPH_CHUNKID_LARGEEDGES: > + if (graph->chunk_large_edges) > + chunk_repeated = 1; > + else > + graph->chunk_large_edges = data + chunk_offset; > + break; > + } Sidenote: verification of this would be even more involved. > + > + if (chunk_repeated) { > + error("chunk id %08x appears multiple times", chunk_id); Wouldn't it be better to print signature, and not raw chunk_id in hex? > + goto cleanup_fail; > + } All right, we fail on first repeated non-repeatable chunk. > + > + if (last_chunk_id == GRAPH_CHUNKID_OIDLOOKUP) > + { > + graph->num_commits = (chunk_offset - last_chunk_offset) > + / graph->hash_len; > + } All right, looks good to me. Sidenote: one should probably verify that (chunk_offset - last_chunk_offset) here is evenly divisible into hash_len. > + > + last_chunk_id = chunk_id; > + last_chunk_offset = chunk_offset; > + } Sidenote: the verification should check that final checksum is correct. > + > + return graph; > + > +cleanup_fail: > + munmap(graph_map, graph_size); > + close(fd); > + exit(1); > +} > + > static void write_graph_chunk_fanout(struct hashfile *f, > struct commit **commits, > int nr_commits) > diff --git a/commit-graph.h b/commit-graph.h > index 16fea993ab..2528478f06 100644 > --- a/commit-graph.h > +++ b/commit-graph.h > @@ -1,6 +1,29 @@ > #ifndef COMMIT_GRAPH_H > #define COMMIT_GRAPH_H > > +#include "git-compat-util.h" > + > +char *get_commit_graph_filename(const char *obj_dir); > + > +struct commit_graph { > + int graph_fd; > + > + const unsigned char *data; > + size_t data_len; All right, this is "raw data". > + > + unsigned char hash_len; > + unsigned char num_chunks; > + uint32_t num_commits; All right. > + struct object_id oid; What is this for? > + > + const uint32_t *chunk_oid_fanout; > + const unsigned char *chunk_oid_lookup; > + const unsigned char *chunk_commit_data; > + const unsigned char *chunk_large_edges; All right, individual chunks (or NULL if chunks does not exist - for optional ones). > +}; > + > +struct commit_graph *load_commit_graph_one(const char *graph_file); > + > void write_commit_graph(const char *obj_dir); > > #endif > diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh > index d7b635bd68..2f44f91193 100755 > --- a/t/t5318-commit-graph.sh > +++ b/t/t5318-commit-graph.sh > @@ -26,10 +26,28 @@ test_expect_success 'create commits and repack' ' > git repack > ' > > +graph_read_expect() { All right, I see that we have unstated convention of not documenting local shell functions in tests. This should have space before parentheses, like this: +graph_read_expect () { ^ \-- here > + OPTIONAL="" > + NUM_CHUNKS=3 > + if test ! -z $2 > + then > + OPTIONAL=" $2" > + NUM_CHUNKS=$((3 + $(echo "$2" | wc -w))) > + fi I don't know if it is possible to do the above in a portable shell without using external 'wc' command. Also, isn't $(( ... )) bashism? Perhaps better solution would be to pass each expected extra chunk as separate parameter, and simply compose OPTIONAL from those subsequent parameters: we know that the separator is space. Also, currently this is overengineered a bit... or just forward-thinking, as we will have at most single-word 2nd parameter, namely "large_edges". > + cat >expect <<- EOF > + header: 43475048 1 1 $NUM_CHUNKS 0 > + num_commits: $1 > + chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL > + EOF > + git commit-graph read >output && > + test_cmp expect output > +} > + > test_expect_success 'write graph' ' > cd "$TRASH_DIRECTORY/full" && > graph1=$(git commit-graph write) && Why do you use command substitution here? 'graph1' variable is not used anywhere I can see, and in all other examples below you simply run "git commit-graph write" without command substitution. > - test_path_is_file $objdir/info/commit-graph > + test_path_is_file $objdir/info/commit-graph && > + graph_read_expect "3" > ' > > test_expect_success 'Add more commits' ' > @@ -72,7 +90,8 @@ test_expect_success 'Add more commits' ' > test_expect_success 'write graph with merges' ' > cd "$TRASH_DIRECTORY/full" && > git commit-graph write && > - test_path_is_file $objdir/info/commit-graph > + test_path_is_file $objdir/info/commit-graph && > + graph_read_expect "10" "large_edges" > ' > > test_expect_success 'Add one more commit' ' [...] Thank you for your patient work on this feature, -- Jakub Narębski