Re: [PATCH v4 05/10] commit-graph: always load commit-graph information

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/29/2018 6:14 PM, Jakub Narebski wrote:
Derrick Stolee <dstolee@xxxxxxxxxxxxx> writes:

Most code paths load commits using lookup_commit() and then
parse_commit().
And this automatically loads commit graph if needed, thanks to changes
in parse_commit_gently(), which parse_commit() uses.

                 In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().
I guess the problem is that we cannot just add parse_commit_in_graph()
like we did in parse_commit_gently(), for some reason?  Like for example
that parse_commit_gently() uses parse_commit_buffer() - which could have
been handled by moving parse_commit_in_graph() down the call chain from
parse_commit_gently() to parse_commit_buffer()... if not the fact that
check_commit() also uses parse_commit_buffer(), but it does not want to
load commit graph.  Am I right?

If a caller uses parse_commit_buffer() directly, then we will guarantee that all values in the struct commit that would be loaded from the buffer are loaded from the buffer. This means we do NOT load the root tree id or commit date from the commit-graph file. We do still need to load the data that is not available in the buffer, such as graph_pos and generation.


With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.
Is it generation number, or generation number and position in commit
graph?

We don't need to ensure the graph_pos (the commit will never be re-parsed, so we will not try to find it in the commit-graph file again), but we DO need to ensure the generation (or our commit walks will be incorrect). We get the graph_pos as a side-effect.


Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter.
I think this commit would be easier to review if it was split into pure
refactoring part (extracting fill_commit_graph_info() and
find_commit_in_graph()).  On the other hand the refactoring was needed
to reduce code duplication betweem existing parse_commit_in_graph() and
new load_commit_graph_info() functions.

I guess that the difference between parse_commit_in_graph() and
load_commit_graph_info() is that the former cares only about having just
enough information that is needed for parse_commit_gently() - and does
not load graph data if commit is parsed, while the latter is about
loading commit-graph data like generation numbers.

Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
---
  commit-graph.c | 45 ++++++++++++++++++++++++++++++---------------
  commit-graph.h |  8 ++++++++
  commit.c       |  7 +++++--
  commit.h       |  2 +-
  object.c       |  2 +-
  sha1_file.c    |  2 +-
  6 files changed, 46 insertions(+), 20 deletions(-)
I wonder if it would be possible to add tests for this feature, for
example that commit-graph is read when it should (including those branch
lookups), and is not read when the feature should be disabled.

But the only way to test it I can think of is a stupid one: create
invalid commit graph, and check that git fails as expected (trying to
read said malformed file), and does not fail if commit graph feature is
disabled.

Let me reorder files (BTW, is there a way for Git to put *.h files
before *.c files in diff?) for easier review:

diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
   */
  int parse_commit_in_graph(struct commit *item);
+/*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
  struct tree *get_commit_tree_in_graph(const struct commit *c);
struct commit_graph {
diff --git a/commit-graph.c b/commit-graph.c
index 047fa9fca5..aebd242def 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct commit_graph *g,
  	return &commit_list_insert(c, pptr)->next;
  }
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)
+{
+	const unsigned char *commit_data = g->chunk_commit_data + GRAPH_DATA_WIDTH * pos;
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
The comment in the header file commit-graph.h talks about filling
graph_pos and generation members of the given commit, but I don't see
filling graph_pos member here.

We are missing the following line:

+    item->graph_pos = pos;

I will add it for v5. The equivalent line exists in fill_commit_in_graph().


Sidenote: it is a tiny little bit strange to see symbolic constants like
GRAPH_DATA_WIDTH near using magic values such as 8 and 2.

There needs to be some boundary between abstraction and concreteness when dealing directly with a binary file format. GRAPH_DATA_WIDTH helps us navigate to the correct "row" in the chunk, while we use the constants 8 and 2 to get the correct "column" out of that row.


+
  static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t pos)
  {
  	uint32_t edge_value;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, uin
  	return 1;
  }
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)
+{
+	if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+		*pos = item->graph_pos;
+		return 1;
+	} else {
+		return bsearch_graph(g, &(item->object.oid), pos);
+	}
+}
Nice refactoring here.

+
  int parse_commit_in_graph(struct commit *item)
  {
+	uint32_t pos;
+
  	if (!core_commit_graph)
  		return 0;
  	if (item->object.parsed)
  		return 1;
-
  	prepare_commit_graph();
-	if (commit_graph) {
-		uint32_t pos;
-		int found;
-		if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-			pos = item->graph_pos;
-			found = 1;
-		} else {
-			found = bsearch_graph(commit_graph, &(item->object.oid), &pos);
-		}
-
-		if (found)
-			return fill_commit_in_graph(item, commit_graph, pos);
-	}
-
+	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
+		return fill_commit_in_graph(item, commit_graph, pos);
  	return 0;
  }
+void load_commit_graph_info(struct commit *item)
+{
+	uint32_t pos;
+	if (!core_commit_graph)
+		return;
+	prepare_commit_graph();
+	if (commit_graph && find_commit_in_graph(item, commit_graph, &pos))
+		fill_commit_graph_info(item, commit_graph, pos);
+}
Similar functions, different goals (as the names imply).

+
  static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
  {
  	struct object_id oid;
diff --git a/commit.c b/commit.c
index 4d00b0a1d6..39a3749abd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, unsigned long *sizep)
  	return ret;
  }
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size)
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph)
  {
  	const char *tail = buffer;
  	const char *bufptr = buffer;
@@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long s
  	}
  	item->date = parse_commit_date(bufptr, tail);
+ if (check_graph)
+		load_commit_graph_info(item);
+
All right, read commit-graph specific data after parsing commit itself.
It is at the end because commit object needs to be parsed sequentially,
and it includes more info that is contained in commit-graph CDAT+EDGE
data.

  	return 0;
  }
@@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int quiet_on_missing)
  		return error("Object %s not a commit",
  			     oid_to_hex(&item->object.oid));
  	}
-	ret = parse_commit_buffer(item, buffer, size);
+	ret = parse_commit_buffer(item, buffer, size, 0);
The parse_commit_gently() contract is that it provides only bare minimum
of information, from commit-graph if possible, and does read object from
disk and parses it only when it could not avoid it.  If it needs to
parse it, it doesn't need to fill commit-graph specific data again.

All right.

  	if (save_commit_buffer && !ret) {
  		set_commit_buffer(item, buffer, size);
  		return 0;
diff --git a/commit.h b/commit.h
index 64436ff44e..b5afde1ae9 100644
--- a/commit.h
+++ b/commit.h
@@ -72,7 +72,7 @@ struct commit *lookup_commit_reference_by_name(const char *name);
   */
  struct commit *lookup_commit_or_die(const struct object_id *oid, const char *ref_name);
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size);
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long size, int check_graph);
  int parse_commit_gently(struct commit *item, int quiet_on_missing);
  static inline int parse_commit(struct commit *item)
  {
diff --git a/object.c b/object.c
index e6ad3f61f0..efe4871325 100644
--- a/object.c
+++ b/object.c
@@ -207,7 +207,7 @@ struct object *parse_object_buffer(const struct object_id *oid, enum object_type
  	} else if (type == OBJ_COMMIT) {
  		struct commit *commit = lookup_commit(oid);
  		if (commit) {
-			if (parse_commit_buffer(commit, buffer, size))
+			if (parse_commit_buffer(commit, buffer, size, 1))
All that rigamarole was needed because of

DS>                 In some cases, including some branch lookups, the commit
DS> is parsed using parse_object_buffer() which side-steps parse_commit() in
DS> favor of parse_commit_buffer().

Here we want parse_object_buffer() to get also commit-graph specific
data, if available.  All right.

  				return NULL;
  			if (!get_cached_commit_buffer(commit, NULL)) {
  				set_commit_buffer(commit, buffer, size);
diff --git a/sha1_file.c b/sha1_file.c
index 1b94f39c4c..0fd4f0b8b6 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1755,7 +1755,7 @@ static void check_commit(const void *buf, size_t size)
  {
  	struct commit c;
  	memset(&c, 0, sizeof(c));
-	if (parse_commit_buffer(&c, buf, size))
+	if (parse_commit_buffer(&c, buf, size, 0))
For check we don't need commit graph data.  Looks all right.

  		die("corrupt commit");
  }
Best,




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux