[PATCH v3 3/4] revision: avoid loading object headers multiple times

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



When loading references, we try to optimize loading of commits by using
the commit graph. To do so, we first need to determine whether the
object actually is a commit or not, which is why we always execute
`oid_object_info()` first. Like this, we'll unpack the object header of
each object first.

This pattern can be quite inefficient in case many references point to
the same commit: if the object didn't end up in the cached objects, then
we'll repeatedly unpack the same object header, even if we've already
seen the object before.

Optimize this pattern by using `lookup_unknown_object()` first in order
to determine whether we've seen the object before. If so, then we don't
need to re-parse the header but can directly use its object information
and thus gain a modest performance improvement. Executed in a real-world
repository with around 2.2 million references:

    Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      4.771 s ±  0.238 s    [User: 4.440 s, System: 0.330 s]
      Range (min … max):    4.539 s …  5.219 s    10 runs

    Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      4.454 s ±  0.037 s    [User: 4.122 s, System: 0.332 s]
      Range (min … max):    4.375 s …  4.496 s    10 runs

    Summary
      'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran
        1.07 ± 0.05 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev'

The downside is that `lookup_unknown_object()` is forced to always
allocate an object such that it's big enough to host all object types'
structs and thus we may waste memory here. This tradeoff is probably
worth it though considering the following struct sizes:

    - commit: 72 bytes
    - tree: 56 bytes
    - blob: 40 bytes
    - tag: 64 bytes

Assuming that in almost all repositories, most references will point to
either a tag or a commit, we'd have a modest increase in memory
consumption of about 12.5% here.

Signed-off-by: Patrick Steinhardt <ps@xxxxxx>
---
 revision.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/revision.c b/revision.c
index f06a5d63a3..671b6d6513 100644
--- a/revision.c
+++ b/revision.c
@@ -359,14 +359,22 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
 				    const struct object_id *oid,
 				    unsigned int flags)
 {
-	struct object *object;
+	struct object *object = lookup_unknown_object(revs->repo, oid);
+
+	if (object->type == OBJ_NONE) {
+		int type = oid_object_info(revs->repo, oid, NULL);
+		if (type < 0 || !object_as_type(object, type, 1)) {
+			object = NULL;
+			goto out;
+		}
+	}
 
 	/*
 	 * If the repository has commit graphs, repo_parse_commit() avoids
 	 * reading the object buffer, so use it whenever possible.
 	 */
-	if (oid_object_info(revs->repo, oid, NULL) == OBJ_COMMIT) {
-		struct commit *c = lookup_commit(revs->repo, oid);
+	if (object->type == OBJ_COMMIT) {
+		struct commit *c = (struct commit *) object;
 		if (!repo_parse_commit(revs->repo, c))
 			object = (struct object *) c;
 		else
@@ -375,6 +383,7 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
 		object = parse_object(revs->repo, oid);
 	}
 
+out:
 	if (!object) {
 		if (revs->ignore_missing)
 			return object;
-- 
2.32.0

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux