On 10/28/2011 02:28 PM, mhagger@xxxxxxxxxxxx wrote: > [...] > This patch series changes how references are stored in the reference > cache data structures (entirely internal to refs.c). Previously, the > packed refs were stored in one big array of pointers-to-struct, and > the loose refs were stored in another. [...] > > Therefore, this patch series changes the data structure used to store > the reference cache from a single array of pointers-to-struct into a > tree-like structure in which each subdirectory of the reference > namespace is stored as an array of pointers-to-entry and entries can > be either references or subdirectories containing more references. > Moreover, each subdirectory of loose references is only read from disk > when a reference from that subdirectory (or one of its descendants) is > needed. This slightly slows down commands that need to iterate > through all references (simply because the new data structures are > more complicated), but it *dramatically* decreases the time needed for > some common operations. For example, in a test repository with 20000 > revisions and 10000 loose tags: Due to the death of my old computer, I got distracted and never published the benchmark results that I cited above. (The numbers here are different than the originals because they are done on a different computer.) The benchmarks are done using code from [1]; the test repository consists of a linear series of 20000 commits (with very little content) and 10000 tags (on every second commit); the tags are unsharded. The numbers are times in seconds; "cold" means that the disk cache was flushed immediately before the test; "warm" usually means that the same operation was done a second time. Column #0 is Git 1.7.7.2; column #1 is Git 1.7.8-rc2; column #2 is a recent git master plus my ref-related patch series (what Junio calls mh/ref-api-take-2 plus the fix that I posted yesterday). The main change between 1.7.7.2 and 1.7.8 is that the latter stores the reference cache in a sorted array rather than a linked list, meaning that it is possible to use bisection to locate a reference quickly by name rather than having to search linearly through a linked list. This greatly helps some operations when most references are packed and can be read from disk quickly. It doesn't make much difference when most references are loose, because the time required to read the loose references from disk overwhelms the time for iterating through the array in memory. The main improvements between 1.7.8 and the new version are when the old code would have read all of the loose references but the new code only needs to read a couple of directories of them. Given that almost all of the references in the test repository are tags, they often don't need to be read at all when branches are being manipulated. By contrast, many old code paths force *all* of the references to be read, for example when they check for replacements in refs/replace/. I've done other benchmarks with varying numbers of references. The results suggest that many operations go from O(N) in the number of loose references to O(1) (e.g., if all they do is check refs/replace/) or some other slow scaling that depends on how the reference namespace is organized. Anything involving packed references necessarily scales at least like O(N) because packed references are all read at once. OTOH reading packed refs is so much faster than reading loose references that with 10000 references, packing is still advantageous. (For some number of references, of course, the curves must cross and loose references will be more efficient than packed references for some operations.) The case of git-filter-branch is particularly dramatic; the old code scales like O(N^2) whereas the new code scales like O(N) as expected. Moreover, git-filter-branch creates lots of loose references while it works, so it is not possible to evade the problem by packing the references before invoking git-filter-branch. I believe that git-filter-branch is mostly slowed down be each subcommand's check for replacements in refs/replace (even though there are none in my test repository) because in the old code this check forces *all* loose references to be read. Versions 1.7.7 and 1.7.8 of git-filter-branch runs much faster if the --no-replace-objects option is used. With these changes, it becomes thinkable to work with repositories with very large numbers of references (especially if the reference namespace is sharded appropriately), whereas in 1.7.7 some operations were annoyingly slow. Michael [1] branch "refperf" at git://github.com/mhagger/git.git > =================================== ======== ======== ======== > Test name [0] [1] [2] > =================================== ======== ======== ======== > branch-loose-cold 1.59 1.68 0.29 > branch-loose-warm 0.04 0.04 0.00 > for-each-ref-loose-cold 1.86 1.96 1.88 > for-each-ref-loose-warm 0.10 0.10 0.11 > checkout-loose-cold 1.66 1.86 0.39 > checkout-loose-warm 0.04 0.05 0.01 > checkout-orphan-loose 0.04 0.04 0.00 > checkout-from-detached-loose-cold 2.04 2.11 1.81 > checkout-from-detached-loose-warm 0.24 0.15 0.16 > branch-contains-loose-cold 1.79 1.86 1.87 > branch-contains-loose-warm 0.14 0.14 0.14 > pack-refs-loose 0.49 0.53 0.53 > branch-packed-cold 0.28 0.25 0.25 > branch-packed-warm 0.02 0.00 0.00 > for-each-ref-packed-cold 0.34 0.39 0.40 > for-each-ref-packed-warm 0.07 0.07 0.07 > checkout-packed-cold 2.81 0.50 0.55 > checkout-packed-warm 0.01 0.00 0.01 > checkout-orphan-packed 0.00 0.00 0.00 > checkout-from-detached-packed-cold 2.83 0.55 0.46 > checkout-from-detached-packed-warm 2.45 0.12 0.13 > branch-contains-packed-cold 0.38 0.32 0.43 > branch-contains-packed-warm 0.11 0.11 0.11 > clone-loose-cold 30.16 30.31 30.51 > clone-loose-warm 1.28 1.30 1.33 > fetch-nothing-loose 0.21 0.39 0.38 > pack-refs 0.14 0.12 0.14 > fetch-nothing-packed 0.21 0.40 0.39 > clone-packed-cold 1.07 1.24 1.18 > clone-packed-warm 0.23 0.23 0.22 > fetch-everything-cold 30.49 30.89 31.09 > fetch-everything-warm 1.78 2.01 2.06 > filter-branch-warm 2949.81 2891.51 440.60 > =================================== ======== ======== ======== > > > [0] 8d19b44 (tag: v1.7.7.2) Git 1.7.7.2 > Test repository created using: t/make-refperf-repo --commits 20000 --refs 10000 > [1] bc1bbe0 (tag: v1.7.8-rc2) Git 1.7.8-rc2 > Test repository created using: t/make-refperf-repo --commits 20000 --refs 10000 > [2] 01494b4 (github/ref-api-D) repack_without_ref(): call clear_packed_ref_cache() > Test repository created using: t/make-refperf-repo --commits 20000 --refs 10000 -- Michael Haggerty mhagger@xxxxxxxxxxxx http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html