Re: rather slow 'git repack' in 'blob:none' partial clones

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Apr 12, 2021 at 07:47:41PM -0400, Jeff King wrote:
> On Mon, Apr 12, 2021 at 11:36:53PM +0200, SZEDER Gábor wrote:
> 
> > All what you wrote above makes sense to me, but I've never looked at
> > how partial clones work, so it doesn't mean anything...  In any case
> > your patch brings great speedups, but, unfortunately, the memory usage
> > remains as high as it was:
> > 
> >   $ /usr/bin/time --format=elapsed: %E  max RSS: %Mk /home/szeder/src/git/bin-wrappers/git -C git-partial.git/ gc
> >   Enumerating objects: 188450, done.
> >   Counting objects: 100% (188450/188450), done.
> >   Delta compression using up to 4 threads
> >   Compressing objects: 100% (66623/66623), done.
> >   Writing objects: 100% (188450/188450), done.
> >   Total 188450 (delta 120109), reused 188450 (delta 120109), pack-reused 0
> >   elapsed: 0:15.18  max RSS: 1888332k
> > 
> > And git.git is not all that large, I wonder how much memory would be
> > necessary to 'gc' a 'blob:none' clone of e.g. chromium?! :)
> > 
> > BTW, this high memory usage in a partial clone is not specific to
> > 'repack', 'fsck' suffers just as much:
> 
> I think the issue is in the exclude-promisor-object code paths of the
> traversal. Try this:
> 
>   [two clones of git.git, one full and one partial]
>   $ git clone --no-local --bare /path/to/git full.git
>   $ git clone --no-local --bare --filter=blob:none /path/to/git partial.git
> 
>   [full clone is quick to traverse all objects]
>   $ time git -C full.git rev-list --count --all
>   63215
>   real	0m0.369s
>   user	0m0.365s
>   sys	0m0.004s
> 
>   [partial is, too; it's the same amount of work because we're just
>    looking at the commits here]
>   $ time git -C partial.git rev-list --count --all
>   63215
>   real	0m0.373s
>   user	0m0.364s
>   sys	0m0.009s
> 
>   [but now ask it to exclude promisor objects, and it's much slower;
>   this is because is_promisor_object() opens up each tree in the pack in
>   order to see which "promised" objects it mentions]

I don't understand this: 'git rev-list --count --all' only counts
commit objects, so why should it open any trees at all?

>   $ time git -C partial.git rev-list --exclude-promisor-objects --count --all
>   0
>   real	0m11.723s
>   user	0m11.354s
>   sys	0m0.369s
> 
> And I think that is the source for the memory use, too. Without that
> option, we peak at 11MB heap. With it, 1.6GB. Oops. Looks like there
> might be some "leaks" when we parse tree objects and then leave the
> buffers connected to the structs (technically not a leak because we
> still have the pointers, but obviously having every tree in memory at
> once is not good).
> 
> The patch below drops the peak heap to 165MB. Still quite a bit more,
> but I think it's a combination of delta-base cache (96MB) plus extra
> structs for all the non-commit objects whose flags we marked.
> 
> It does seem like there's probably a good space/time tradeoff to be made
> here (e.g., caching the list of "promisor" object ids for the pack
> instead of inflating and reading all of the trees on the fly).
> 
> diff --git a/packfile.c b/packfile.c
> index 8668345d93..b79cbc8cd4 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -2247,6 +2247,7 @@ static int add_promisor_object(const struct object_id *oid,
>  			return 0;
>  		while (tree_entry_gently(&desc, &entry))
>  			oidset_insert(set, &entry.oid);
> +		free_tree_buffer(tree);
>  	} else if (obj->type == OBJ_COMMIT) {
>  		struct commit *commit = (struct commit *) obj;
>  		struct commit_list *parents = commit->parents;
> diff --git a/revision.c b/revision.c
> index 553c0faa9b..fac2577748 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -3271,8 +3271,15 @@ static int mark_uninteresting(const struct object_id *oid,
>  			      void *cb)
>  {
>  	struct rev_info *revs = cb;
> +	/*
> +	 * yikes, do we really need to parse here? maybe

Heh, a "yikes" here and a "yuck" in your previous patch...  This issue
was worth reporting :)

> +	 * lookup_unknown_object() would be sufficient, or
> +	 * even oid_object_info() followed by the correct type
> +	 */
>  	struct object *o = parse_object(revs->repo, oid);
>  	o->flags |= UNINTERESTING | SEEN;
> +	if (o->type == OBJ_TREE)
> +		free_tree_buffer((struct tree *)o);
>  	return 0;
>  }
>  
> 
> -Peff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux