Re: Strange O(N^3) behavior in "git filter-branch"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07/14/2011 11:24 AM, Michael Haggerty wrote:
> On 07/14/2011 09:16 AM, Michael Haggerty wrote:
>> I have noticed that "git filter-branch" gets pathologically slow when it
>> operates on a repository that has many references in a complicated
>> directory hierarchy.  The time seems to go like O(N^3), where N is the
>> number of references being rewritten.
> 
> New information: Once I have "primed" a git repository using the
> attached script, a command like the following takes about 90 ms:
> 
>     git rev-parse refs/heads/a4/b1/c0414^0
> 
> strace reveals that it is calling stat64() then lstat64() on every
> directory and every file under .git/refs.

It looks like the time is being taken up by attempted object
replacement.  The --no-replace-objects option makes the lookup fast again:

$ time git rev-parse
refs/heads/a4/b1/c0414^0c9d42cb64d465f354c1fa6eea5825ebb6d0483a7

real	0m0.093s
user	0m0.064s
sys	0m0.032s
$ time git --no-replace-objects rev-parse refs/heads/a4/b1/c0414^0
c9d42cb64d465f354c1fa6eea5825ebb6d0483a7

real	0m0.007s
user	0m0.000s
sys	0m0.012s

The guilty code path is here:

> #0  get_ref_dir (submodule=0x0, base=0x81384dd "refs", list=0x0) at refs.c:260
> #1  0x080f8a63 in get_loose_refs (submodule=0x0) at refs.c:368
> #2  0x080f912f in do_for_each_ref (submodule=0x0, base=<value optimized out>, fn=<value optimized out>, trim=13, flags=0, cb_data=0x0) at refs.c:663
> #3  0x080f92db in for_each_replace_ref (fn=0x80fd1c0 <register_replace_ref>, cb_data=0x0) at refs.c:782
> #4  0x080fd174 in prepare_replace_object (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247") at replace_object.c:86
> #5  do_lookup_replace_object (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247") at replace_object.c:103
> #6  0x080e7920 in lookup_replace_object (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247") at cache.h:774
> #7  parse_object (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247") at object.c:191
> #8  0x080bcf2a in lookup_commit_reference_gently (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247", quiet=0) at commit.c:30
> #9  0x080bcf89 in lookup_commit_reference (sha1=0xbfffde68 "\311\324,\266MF_5L\037\246^\273m\004\203\247") at commit.c:39
> #10 0x0810e2a8 in get_parent (name=<value optimized out>, len=<value optimized out>, sha1=<value optimized out>) at sha1_name.c:456
> #11 get_sha1_1 (name=<value optimized out>, len=<value optimized out>, sha1=<value optimized out>) at sha1_name.c:663
> #12 0x0810e9dd in get_sha1_with_context_1 (name=0xbffff48c "refs/heads/a4/b1/c0414^0", sha1=0xbffff098 "\310\360\377\277\001C\021\b\300\265䷂\364\377\277", oc=0xbfffefec, only_to_die=0, prefix=0x0) at sha1_name.c:1123
> #13 0x0810f290 in get_sha1_with_context (name=0xbffff48c "refs/heads/a4/b1/c0414^0", sha1=0xbffff098 "\310\360\377\277\001C\021\b\300\265䷂\364\377\277") at cache.h:824
> #14 get_sha1 (name=0xbffff48c "refs/heads/a4/b1/c0414^0", sha1=0xbffff098 "\310\360\377\277\001C\021\b\300\265䷂\364\377\277") at sha1_name.c:987
> #15 0x080a258a in cmd_rev_parse (argc=2, argv=0xbffff2a4, prefix=0x0) at builtin/rev-parse.c:715
> #16 0x0804b917 in run_builtin (argc=<value optimized out>, argv=<value optimized out>) at git.c:302
> #17 handle_internal_command (argc=<value optimized out>, argv=<value optimized out>) at git.c:460
> #18 0x0804c00a in main (argc=2, argv=0xbffff2a4) at git.c:545

The replace_object machinery is causing the whole loose reference cache
to be populated even though it is only interested in references under
refs/replace (which, in this case, doesn't even exist).

A many possible improvements come to mind, in increasing order of
intrusiveness and generality:

0. Individual users can use --no-replace-objects or
GIT_NO_REPLACE_OBJECTS to reduce their pain, especially when running
git-filter-branch.  But this is cumbersome and error-prone.

1. Change git-filter-branch (and any other long-running commands?) to do
an initial check for the presence of replace references (packed or
loose), and if there are none, set GIT_NO_REPLACE_OBJECTS automatically.
 This would of course fail if any of the user's scripts try to set up
replace references.  (Side note: perhaps the git-replace command should
complain if GIT_NO_REPLACE_OBJECTS is turned on?  It would almost always
indicate a mistake.)  It also wouldn't help in repositories that *have*
replace references.

2. Add an option to git-filter-branch to have it pack references
occasionally.  This could even be made the default behavior if nobody
objects to having their references packed without their conscious decision.

3. Optimize the specific case where there is no refs/replace
directory--if this directory is missing, then defer populating the loose
refs cache in the hope that it will never be needed.

4. Optimize reading of loose refs in general--change get_loose_refs() to
take a "base" parameter, and defer populating the loose refs cache if
the directory corresponding to "base" is absent.  This would salvage the
worst cases, but would always force the whole loose refs cache to be
read even in cases when only part of it is needed.

5. Organize the loose refs cache in memory as a tree, and only populate
the parts of it that are accessed.  This should also speed up iteration
through a subtree by avoiding a linear search through all loose references.

Unfortunately, none of these improvements would change the fact that the
packed refs always have to be read in full, and thus that *basically
all* git commands scale like O(total number of references).  But packed
refs are so much faster than loose refs that this is probably acceptable.

I'd be willing to work on this problem if I get a little bit of feedback
from more experienced git developers about which approach would be
acceptable / preferred.

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]