git clean performance issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I'm having a performance issue with "git clean -qxfd" (note, not using
"-ff").

The performance issue shows up when trying to clean untracked
directories that themselves contain many sub directories. The
performance is highly non linear with the number of sub
directories. Some test numbers:

Dirs    Time
10000   0m0.754s
50000   0m16.606s
100000  1m31.418s

When running "git clean -qxffd" (note, using "-ff") the performance is
fast and linear:

Dirs    Time
10000   0m0.158s
50000   0m0.918s
100000  0m1.639s

After checking the source of git-clean my understanding of the problem
is as follows:

When clean.c:cmd_clean finds a directory and the "-d" flag is given it
will call clean.c:remove_dirs to potentially remove the directory and
all sub directories.

Unless "-ff" is given remove_dirs tries to be nice and not remove
directories containing other git repositories. To do this it does the
following check:

    ...
    if ((force_flag & REMOVE_DIR_KEEP_NESTED_GIT) &&
            !resolve_gitlink_ref(path->buf, "HEAD", submodule_head)) {
    ...

The problem is that refs.c:resolve_gitlink_ref will call
refs.c:get_ref_cache that will linearly search a linked list of cache
entries and create and insert a new ref_cache entry in the list for
each path it is given if it fails to find an existing entry:

    for (refs = submodule_ref_caches; refs; refs = refs->next)
        if (!strcmp(submodule, refs->name))
            return refs;

    refs = create_ref_cache(submodule);
    refs->next = submodule_ref_caches;
    submodule_ref_caches = refs;
    return refs;

In my scenario get_ref_cache will be called 10000+ times, each time
with a new path. The final few calls will need to search through and
compare 10000+ entries before realizing that there is no existing
entry. This quickly ads up to 100 million+ calls to strcmp().

>From what I can understand, the calls to get_ref_cache in this
scenario will never do any useful work. Is this correct? If so, would
it be possible to bypass it, maybe by calling
resolve_gitlink_ref_recursive directly or by using some other way of
checking for the presence of a git repo in clean.c:remove_dirs?

/Erik
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]