[announce] (nearly) full jgit revision machinary

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have a nearly complete revision machinary implementation for jgit.
The series follows on top of my pack index v2 work, which is in my
master branch of repo.or.cz/egit/spearce.git.

This series started because I found the History view was just too
slow for words.  Running JProbe on Eclipse pointed out that our
existing revision walking code was not up to the challenge of a
real-world sized repository.

Here goes a summary...  series is here:

    git://repo.or.cz/egit/spearce.git revwalk


  Fixes and cleanups
  ------------------
    * Correct unsigned integer conversion issues
    * Include refs/remotes as part of the ref search path
    * Only use bytes 1-5 for a SHA-1 ObjectId hashCode
    * Fix the OBJECT_ID_LENGTH constant at 20
    * Hand unroll the ObjectId.equals method
    * Don't bother caching commits in the delta base cache
    * Teach FileMode how to return the proper instance of itself
    * Teach FileMode about S_IFGITLINK being 0160000
    * Teach FileMode about MISSING being 0

    These have nothing to do with the series directly, but were
    done to improve its code or fix some earlier bugs.


  Revision Machinary
  ------------------
    * New revision walking library for jgit
    * Teach RevWalk about uninteresting commits
    * Document the flag constants used in RevWalk
    * Implement a commit filtering API for RevWalk
    * Make all RevFilter implementations cloneable
    * Allow RevFilters to break out of the main walk loop
    * Implement a commit time based RevFilter for before/after filtering
    * Refactor basic commit walking from RevWalk to a streamable iterator API
    * Evaluate RevFilters before inserting parent commits to pending
    * Dispose of a RevCommit as soon as it is no longer interesting
    * Use a FIFO queue of RevCommit during reset rather than sorting by date

    Aside from --topo-order this is a nearly full revision machinary.
    Features like --author, --committer, --grep, --not, all work.
    We don't parse "A..B" but we can do it as "A --not B", or
    "A ^B ^C D".

    The package's performance on Java 6/x86/Windows is well within
    spitting distance of Git on Cygwin.  I'm seeing jgit take just
    over 50 ms longer to produce back ~13,000 commits.

    I'll get --topo-order done soon, most likely this weekend. It
    will probably also get an incremental restart trick like we do
    in C Git, where we shovel the data out as early as we can but
    signal a restart if topo order was violated.

    Memory can still be reduced a tad, but its fast as heck.
    Our main suckage for memory is the ObjectId class.  On a modern
    HotSpot JVM it costs 44 bytes to store a 20 byte SHA-1.  I think
    our second main suckage is the ObjectIdMap.  java.util.HashMap is
    not that efficient when it comes to memory usage.


  Tree Diff
  ---------
    * New tree walking library for jgit
    * Implement a tree entry filtering API for TreeWalk
    * Define a path limiter for TreeWalk to reduce results

    Turns out the Tree class is too slow for serious work.  This is
    a concurrent tree walker that can do an N-way difference of
    an arbitrary number of trees, recursively and shallow.  We can
    also limit the difference to only entries matching a particular
    filter expression.

    I spent a good amount of time tuning this library, but it
    probably can still benefit from more micro-optimization.  Java is
    just dog slow, but I think we're doing almost the best we can.

    At present this library only does tree objects, but it should
    also be able to walk the local filesystem and an index (or
    two or three) concurrently with trees.  I'll implement those
    other iterators in the near future.


  Path Limiter & Parent Rewrites
  ------------------------------
    * Implement path limited revision walking and parent rewriting
    * Allow revwalk and treewalk to directly access cached object data

    These changes give us functionality like "git log HEAD -- foo.c",
    where the DAG gets subsetted down to only those commits that
    had an interesting change against foo.c.  Yes, it also works
    on directories and combinations of paths.

    I've done a bunch of manual testing and differencing against
    the C Git output for the same inputs, and jgit is producing
    identical results.  That's with "--parents".  So we can now
    get a dense, plottable subsetted DAG. Yay.

    When the path limiter is enabled we spend ~98% of our CPU time
    inside of the treewalk package introduced above for the tree
    difference feature.  Its horribly slow.  If the path we are
    looking for is early enough in the tree we're not too far off
    from C Git's performance, but try matching on say "t" and the
    world falls over.  We are 4x slower.


  Fixes and cleanups
  ------------------
    * Implement some basic command line programs built on top of jgit

    This a "jgit" command line wrapper, along with basic subcommand
    tools like "log", "rev-list", "ls-tree" and "diff-tree", all
    built on top of the above functionality.  jgit is starting to
    feel like it is actually capable of doing some of the major
    functions C Git does.

    "jgit log | less" has a slight lag before the first commit when
    compared to "git log", but is still faster than "svn log".  ;-


-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux