On performance characteristics of pickaxe with -M, -C, and -C -C

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've been brewing parts of git-pickaxe in "pu" for the last week
to fix both correctness and performance issues, and it appears
it is now in a good enough shape to even replace git-blame for
everyday use, so I've placed my recent work in "next".

 - Under -M or -C, it tried to find lines that are copied
   around, but it did its scan only once.  When a scan result
   splits the blame for a block of lines into three blocks:

     original span of blame <----------------------------->
     lines matched by -M/-C            <----->     

    ==>

     result of split        <---------><.....><----------->

   the middle part that matched passed the blame to the parent,
   but the new two parts created by this split was not
   re-scanned by the old code and the blame stayed with the
   commit we were inspecting.  The updated code re-scans for
   these two new parts and allows them to pass blame to their
   parents.

 - The implementation of -C -C (aka "find copies harder") was
   very broken.  Under this flag, the program should try all the
   paths in the parent as the candidates of copies for a path
   that is new (that is, not a modification from the original
   path, nor rename/copy from an existing path).  However, it
   used diffcore incorrectly and used only the paths that were
   changed from parent to child (i.e. the same as a single -C).
   The updated code works as advertised.  This is very expensive
   but the option exists in case when you know you want to pay
   the price to find out all the possible cuts-and-pastes.

 - The old code was reading each blob typically twice -- once as
   the "parent's blob" when a commit is compared with its
   parent, and then as the "child's blob" when that parent was
   trying to pass the blame to its parent.  The updated code
   reuses these blobs between the parent and the child and is
   much more efficient.

The following is by no means a scientific test, but here are two
comparisons using blame and pickaxe with or without -M/-C/-C -C.
One is our own revision.c, and the other is kernel/sched.c from
the kernel project.  The number of minor faults in 'time' output
gives a rough indication of the memory footprint of the process.

----------------------------------------------------------------

* git-blame revision.c
0.76user 0.01system 0:00.77elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+3269minor)pagefaults 0swaps
reads blob 82 times

* git-pickaxe revision.c
0.78user 0.00system 0:00.78elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2500minor)pagefaults 0swaps
reads blob 92 times

    These two give comparable results.  The only difference is
    that git-blame says line 319 is from a41e109c (2006-03-12)
    while git-pickaxe says it is from 8efdc326 (2006-03-10).
    Manual inspection of both tells me that they are both valid
    and reasonable.

* git-pickaxe -M revision.c
0.95user 0.02system 0:00.97elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2410minor)pagefaults 0swaps
reads blob 92 times

    This notices that ll. 528-531, 695-701 and 791-794 are not
    new lines introduced by the commits git-blame and
    git-pickaxe without -M place blame on, but the commits just
    shuffled existing lines.  For an example, look at output
    from:

	git show 53069686 -- revision.c

    The commit moved around the code to parse --min-age= option;
    git-pickaxe -M does not blame 53069686 for it, but git-blame
    and git-pickaxe without -M do.

    Because this flag does not cause the command to look for
    copied lines across files (other than the usual rename
    detection), the set of blobs it works on is the same as the
    command without -M.

* git-pickaxe -C revision.c
2.52user 0.03system 0:02.55elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+5262minor)pagefaults 0swaps
reads blob 486 times

    This starts noticing that a lot of the code actually came
    from rev-list.c as mentioned in an earlier message by Linus.

	Message-ID: <Pine.LNX.4.64.0610201630000.3962@xxxxxxxxxxx>

    To find new suspects that are not the same path (or renamed
    path) in the parent, the program inspects files that were
    changed from its parent and that is how it finds that many
    lines came from rev-list.c.  Because it needs to work on
    more blobs than the above cases, it is a heavier weight
    operation.

* git-pickaxe -C -C revision.c
10.60user 0.06system 0:10.66elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+19762minor)pagefaults 0swaps
reads blob 1839 times

    This does not notice anything new compared to the above for
    this file's history.

----------------------------------------------------------------

* git-blame kernel/sched.c
4.68user 0.63system 0:05.31elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+276957minor)pagefaults 0swaps
reads blob 177 times

* git-pickaxe kernel/sched.c
5.63user 0.66system 0:06.29elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+276274minor)pagefaults 0swaps
reads blob 181 times

    The above two produce identical results.

* git-pickaxe -M kernel/sched.c
9.62user 1.35system 0:10.96elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+388517minor)pagefaults 0swaps
reads blob 181 times

    The story is the same as in our earlier example on revision.c.
    For example, look for comments about "task_timeslice()" in:

            git show 91fcdd4e -- kernel/sched.c

* git-pickaxe -C kernel/sched.c
12.94user 1.08system 0:14.04elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+407624minor)pagefaults 0swaps
reads blob 667 times

    This notices that many lines came from arch/ia64/kernel/domain.c

* git-pickaxe -C -C kernel/sched.c
13.09user 1.02system 0:14.11elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+407626minor)pagefaults 0swaps
reads blob 667 times

    For the kernel history (since 2.6.12-rc2) kernel/sched.c is not
    a new file, so there is no difference in the output or
    performance between -C and -C -C.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]