Re: Strange O(N^3) behavior in "git filter-branch"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07/15/2011 11:20 PM, Jeff King wrote:
> On Fri, Jul 15, 2011 at 11:51:49AM -0700, Junio C Hamano wrote:
>> But I think the replace-object codepath should be optimized to realize
>> there is no funky replacement (which _is_ a rare configuration) going on
>> much early so that it does not incur that much overhead you observed. IOW,
>> I tend to agree with your 3. below.
>> [...]
>>> 3. Optimize the specific case where there is no refs/replace
>>> directory--if this directory is missing, then defer populating the loose
>>> refs cache in the hope that it will never be needed.
> 
> It already tries to do so. It looks like it calls:
> 
>   for_each_replace_ref(...)
> 
> which calls:
> 
>   do_for_each_ref("refs/replace", ...)
> 
> which reads _every_ loose ref, regardless of the prefix we have given
> it. So the optimization should go into the for_each_ref code, which
> should avoid looking at parts of the hierarchy that are just going to be
> culled, no?

The way to implement the deferral of reading refs if for_each_ref() is
called on an empty subtree (alternative 4 or 5 from my earlier email) is
to change get_loose_refs() to take a base argument, have it check
whether the directory corresponding to base is empty, and if so return
NULL without reading anything into the cache.

Currently, the loose ref cache is stored as a single linked list, so
there is no easy way to populate part of it now and part of it later.
So with the current data structure, the loose refs cache is
all-or-nothing.  It would be possible to avoid filling it if there are
not replace references, but if there is even one loose replace reference
then the whole refs tree would have to be crawled.  Implementing this
variation is alternative 4 from the early email.

More flexible would be to change the way the loose ref cache is stored
from a linked list into a tree (probably mirroring the directory tree).
 If this were done, then it would be possible to populate the cache
lazily, only crawling the part of the refs tree that is needed for a
particular call of for_each_ref() and reusing any part of the cache that
is already in memory.  This (alternative "5") is considerably more
involved because the data structure has to be changed, but also
potentially a much bigger win because the presence of a single reference
in refs/replace would not force all of the refs/heads and refs/tags and
... to be read.

> And then we would see immediately that there is no refs/replace at all,
> and quit early. And as a bonus, things like "for_each_tag_ref" would get
> faster in a repository with many branches, too.

Only if the data structure used to hold the loose refs cache is changed...

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]