Re: [PATCH] bloom: ignore renames when computing changed paths

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/8/2020 6:31 PM, Jeff King wrote:
> On Wed, Apr 08, 2020 at 04:38:27PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>>
>> The changed-path Bloom filters record an entry in the filter for
>> every path that was changed. This includes every add and delete,
>> regardless of whther a rename was detected. Detecting renames
>> causes significant performance issues, but also will trigger
>> downloading missing blobs in partial clone.
>>
>> The simple fix is to disable rename detection when computing a
>> changed-path Bloom filter.
> 
> Yes, we should be doing as simple a tree-diff as possible.
> 
> I wonder if this might actually be fixing a potential bug, too. The loop
> below this code only looks at the "two" half of each queued diff pair.
> With renames enabled, we might see the source path only in the "one"
> half of the rename pair.
> 
> However, this seems to work already. If I do:
> 
>   echo content >file
>   git add file
>   git commit -m added
>   echo change >file
>   git commit -am changed
>   git mv file other
>   git commit -am 'rename away'
>   git mv other file
>   git commit -am 'rename back'
>   git rm file
>   git commit -am removed
> 
>   git commit-graph write --reachable --changed-paths
>   git log --oneline -- file | cat
> 
> then I see all of the commits. Instrumenting Git like:
> 
> diff --git a/bloom.c b/bloom.c
> index c5b461d1cf..fb2a758e1d 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -207,6 +207,10 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		for (i = 0; i < diff_queued_diff.nr; i++) {
>  			const char *path = diff_queued_diff.queue[i]->two->path;
>  
> +			warning("queuing touched %c path %s for %s",
> +				diff_queued_diff.queue[i]->status,
> +				path, oid_to_hex(&c->object.oid));
> +
>  			/*
>  			* Add each leading directory of the changed file, i.e. for
>  			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> 
> results in:
> 
>   warning: queuing touched A path file for 2346d88b0cb4bca11c38ee545d007a7a14ca472a
>   warning: queuing touched M path file for 991cd7f0696ae29fea738ca1b8340c90dae4b201
>   warning: queuing touched D path file for d3642c9fb27459ea09f6c967a1e6ad119e265d6f
>   warning: queuing touched A path other for d3642c9fb27459ea09f6c967a1e6ad119e265d6f
>   warning: queuing touched A path file for bc908eb29e562d97ebb8cf718e41b69d3aa1d834
>   warning: queuing touched D path other for bc908eb29e562d97ebb8cf718e41b69d3aa1d834
>   warning: queuing touched D path file for 7433b46bd6aa170ab17a651c10658a5b0c10ba4f
> 
> So we really aren't detecting renames in the first place! And indeed,
> checking diffopt.detect_rename shows that it is unset. So I'm curious if
> there is a case where that would not be true. I _think_ it would only be
> true in a program which ran init_diff_ui_defaults(), but never in
> git-commit-graph.

So our issue was really that the partial clone prefetch logic was just
overly aggressive.

> Even if it does nothing in practice, I'm not at all opposed to having it
> in there as an explicit documentation of our expectations/requirements
> for the loop below. But it's probably worth saying so in the commit
> message.

I will update the message and send a v2. I'll fix the typo that Philip pointed
out, too.

Thanks,
-Stolee




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux