[PATCH 00/28] Store references hierarchically in cache

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Michael Haggerty <mhagger@xxxxxxxxxxxx>

This patch series applies on top of v3 of "Use refs API more
consistently", which ultimately branches from gitster/master.

This patch series changes how references are stored in the reference
cache data structures (entirely internal to refs.c).  Previously, the
packed refs were stored in one big array of pointers-to-struct, and
the loose refs were stored in another.  This had a few problems:

* Whenever any loose refs were needed, the whole refs tree had to be
  read from disk into memory.  This can be quite expensive when there
  are a lot of loose references.  And it is often overkill, as
  frequently only a small part of the refs tree is needed.  For
  example:

  * refs/replace is almost *always* needed even though it often
    doesn't even exist.  Thus the presence of many loose references
    slows down *many* git commands for no reason whatsoever.

  * When a new reference is created, is_refname_available() is called
    to see whether there is another another reference whose name
    conflicts with the new one.  Currently this loads and iterates
    through *all* references.  But there are only a few refnames that
    can possibly conflict; for example, given the refname
    "refs/heads/foo/bar", the only possible conflicts are with
    "refs/heads/foo" and "refs/heads/foo/bar/*".  Therefore it is
    silly to load and iterate through the whole refname hierarchy.

  * "git for-each-ref" is capable of searching a subtree of the
    references.  But currently this causes all references to be
    loaded.

Therefore, this patch series changes the data structure used to store
the reference cache from a single array of pointers-to-struct into a
tree-like structure in which each subdirectory of the reference
namespace is stored as an array of pointers-to-entry and entries can
be either references or subdirectories containing more references.
Moreover, each subdirectory of loose references is only read from disk
when a reference from that subdirectory (or one of its descendants) is
needed.  This slightly slows down commands that need to iterate
through all references (simply because the new data structures are
more complicated), but it *dramatically* decreases the time needed for
some common operations.  For example, in a test repository with 20000
revisions and 10000 loose tags:

* the time to create a new branch goes from 180 ms to less than 10 ms
  (my test resolution only includes two decimal places) and the time
  to checkout a new branch does the same.

* the time for a "git filter-branch" of all commits (which used to
  scale like N^2) goes from 4 hours to 13 minutes.  (Since
  filter-branch necessarily *creates* lots of loose references, the
  savings are also there if the references are originally packed.)

The efficiency gains are such that some operations are now faster with
loose references than with packed references; however, some operations
with packed references slow down a bit.

These changes do not increase the amount of space per reference needed
for the reference cache, but they do add one similarly-sized entry for
each subdirectory (for each of loose and packed).  I don't think that
the space increase should be significant in any reasonable situation.

After these changes, there is a benefit to sharding the reference
namespace, especially for loose references.


The patches:

The first few patches are just preparation.

Patch 6 "refs: store references hierarchically" introduces the big
data structure change.  It causes much extra sorting to be done and
therefore slows down performance significantly, but the following two
commits restore performance to approximately the old level.

Patches 11-24 change most of the internal functions to work with
"struct ref_entry *" (namely the kind of ref_entry that holds a
directory of references) instead of "struct ref_dir *".  The reason
for this change it to allow these functions access to the "flag" and
"name" fields that are stored in ref_entry and thereby avoid having to
store redundant information in "struct ref_dir" (which would increase
the size of *every* ref_entry because of its presence in the union).

The other patches (9-10 and 25-28) reap the benefits of the new data
structure:

* do_for_each_ref() can iterate only over the desired subtree instead
  of iterating over all references and filtering out the unwanted ones

* loose references can be read on demand, one directory at a time

* is_refname_available() can query only the possibly-conflicting parts
  of the reference namespace

Michael Haggerty (28):
  refs.c: reorder definitions more logically
  free_ref_entry(): new function
  check_refname_component(): return 0 for zero-length components
  struct ref_entry: nest the value part in a union
  refs.c: rename ref_array -> ref_dir
  refs: store references hierarchically
  sort_ref_dir(): do not sort if already sorted
  refs: sort ref_dirs lazily
  do_for_each_ref(): only iterate over the subtree that was requested
  get_ref_dir(): keep track of the current ref_dir
  refs: wrap top-level ref_dirs in ref_entries
  get_packed_refs(): return (ref_entry *) instead of (ref_dir *)
  get_loose_refs(): return (ref_entry *) instead of (ref_dir *)
  is_refname_available(): take (ref_entry *) instead of (ref_dir *)
  find_ref(): take (ref_entry *) instead of (ref_dir *)
  read_packed_refs(): take (ref_entry *) instead of (ref_dir *)
  add_ref(): take (ref_entry *) instead of (ref_dir *)
  find_containing_direntry(): use (ref_entry *) instead of (ref_dir *)
  search_ref_dir(): take (ref_entry *) instead of (ref_dir *)
  add_entry(): take (ref_entry *) instead of (ref_dir *)
  do_for_each_ref_in_dir*(): take (ref_entry *) instead of (ref_dir *)
  sort_ref_dir(): take (ref_entry *) instead of (ref_dir *)
  struct ref_dir: store a reference to the enclosing ref_cache
  read_loose_refs(): take a (ref_entry *) as argument
  refs: read loose references lazily
  is_refname_available(): query only possibly-conflicting references
  read_packed_refs(): keep track of the directory being worked in
  repack_without_ref(): call clear_packed_ref_cache()

 refs.c | 1268 ++++++++++++++++++++++++++++++++++++++++++---------------------
 refs.h |    7 +-
 2 files changed, 850 insertions(+), 425 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]