Re: rather slow 'git repack' in 'blob:none' partial clones

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Apr 05, 2021 at 03:02:33AM +0200, Rafael Silva wrote:

> When I was digging into the code and adding trace2_region_*() calls, I
> notice most of the time spent on the `git gc` (for the reported
> situation) was in:
> 
>        # In builtin/pack-objects.c
>        static void get_object_list(int ac, const char **av)
>        {
>                ...
>                if (unpack_unreachable)
>                        loosen_unused_packed_objects();
>                ...
>        }

Yeah, good find.

This is my first time looking at the repacking strategy for partial
clones. It looks like we run an initial pack-objects to cover all the
promisor objects, and then do the "real" repack for everything else,
with "--exclude-promisor-objects".

The purpose of loosen_unused_packed_objects() is to catch any objects
that will be lost when our caller deletes all of the packs. But in this
case, those promisor objects are in a pack which won't be deleted, so
they should not be included.

> Another interesting thing is, the loosen_unused_packed_objects()
> function is being called twice because the function loads all packs
> files, via get_all_packs(), which will return the .temp-*pack file that
> is created by the `git pack-objects` child process from `git gc`:
> 
>     git pack-objects ... --delta-base-offset objects/pack/.tmp-82853-pack ...

Yes, this is the "promisor" pack created by git-repack. It seems like
git-repack should tell pack-objects about the new pack with --keep-pack,
so that we know it is not going to be deleted.

That would also solve the rest of the problem, I _think_. In your
suggestion here:

> I'm not entirely sure about this (not this late in the day), but it seems to
> me that we should simply skip the "missing" (promisor) files when
> operating on a partial clone.
> 
> Perhaps something like:
> 
> --- >8 ---
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 525c2d8552..fedf58323d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3468,6 +3468,8 @@ static int loosened_object_can_be_discarded(const struct object_id *oid,
>  {
>         if (!unpack_unreachable_expiration)
>                 return 0;
> +       if (exclude_promisor_objects && is_promisor_object(oid))
> +               return 1;
>         if (mtime > unpack_unreachable_expiration)
>                 return 0;
>         if (oid_array_lookup(&recent_objects, oid) >= 0)
> --- >8 ---

you are avoiding writing out the file. But we should realize much
earlier that it is not something we need to even consider loosening.

In the loop in loosen_unused_packed_objects(), we skip packs that are
marked as "keep", so we'd skip the new promisor pack entirely. But we'd
still see all these objects in the _old_ promisor pack. However, for
each object there, we call has_sha1_pack_kept_or_nonlocal(), so that
would likewise realize that each object is already being kept in the
other pack.

Something like this seems to work, but I only lightly tested it, and it
could probably use some refactoring to make it less horrible:

diff --git a/builtin/repack.c b/builtin/repack.c
index fdee8e4578..457525953a 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -574,6 +574,23 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 		repack_promisor_objects(&po_args, &names);
 
 		if (existing_packs.nr && delete_redundant) {
+			/*
+			 * tell pack-objects about our new promisor pack, which
+			 * we will also be keeping
+			 */
+			for_each_string_list_item(item, &names) {
+				/*
+				 * yuck, we seem to only have the name with the
+				 * packdir prefixed
+				 */
+				const char *prefix;
+				if (!skip_prefix(packtmp, packdir, &prefix) ||
+				    *prefix++ != '/')
+					BUG("confused by packtmp");
+				strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
+					     prefix, item->string);
+			}
+
 			if (unpack_unreachable) {
 				strvec_pushf(&cmd.args,
 					     "--unpack-unreachable=%s",

Do you want to try to work with that?

> A quick benchmark did show some promising result:
> 
>     # built from: 2e36527f23 (The sixth batch, 2021-04-02)
>     Benchmark #1: ./bin-wrappers/git -C git.git gc
>           Time (mean ± σ):     135.669 s ±  0.665 s    [User: 42.789 s, System: 91.332 s]
>           Range (min … max):   134.905 s … 136.115 s    3 runs
> 
>     # built from: 2e36527f23 + minor patch (from above)
>     Benchmark #2: ./bin-wrappers/git -C git.git gc
>           Time (mean ± σ):     12.586 s ±  0.031 s    [User: 11.462 s, System: 1.365 s]
>           Range (min … max):   12.553 s … 12.616 s    3 runs
> 
>     Summary:
>           'Benchmark #2' ran 10.78 ± 0.06 times faster than 'Benchmark #1'

It's still quite a bit slower than a non-partial clone because the
traversal with --exclude-promisor-objects is slow. I think that's
because it has to open up all of the objects in the promisor pack to see
what they refer to. I don't know if we can do better (and it's largely
an orthogonal problem to what you're solving here, so it probably makes
sense to just punt on it for now).

-Peff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux