[PATCH 1/7] revision-cache: define revision cache as simple list of revisions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The very first thing that we want out of the revision cache is to be
able to go from a commit to all of its referred objects.  Define a
revision cache that includes just that.

Signed-off-by: Sam Vilain <sam@xxxxxxxxxx>
---
 Documentation/technical/revision-cache.txt |  260 ++++++++++++++++++++++++++++
 1 files changed, 260 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/technical/revision-cache.txt

diff --git a/Documentation/technical/revision-cache.txt b/Documentation/technical/revision-cache.txt
new file mode 100644
index 0000000..cc18535
--- /dev/null
+++ b/Documentation/technical/revision-cache.txt
@@ -0,0 +1,260 @@
+Revision Cache Format
+=====================
+
+The revision cache is an on-disk format which allows for certain graph
+reachability operations to be answered quickly.
+
+A revision cache contains;
+
+  - A 'start' object (ie, 'interesting' object ID)
+
+  - A list of objects which are referred to by the 'start' object
+
+    * position when sorted in --date-order
+
+    * object ID
+
+
+Start Object
+------------
+
+The 'start' object is the identifying key of the revision cache.
+
+
+Topological contents list
+-------------------------
+
+This list has fixed-length records, so the topological position into
+the list does not need to be stored in each record - it is implicit
+from the offset.
+
+--date-order is used as it is the strictest sort order available, but
+this still only specifies an ordering for commit objects.  Other
+objects will appear after the object which first refers to them.  Tag
+objects are sorted as if they were commit objects with a single
+parent, the object they tag.
+
+
+Use Cases
+---------
+In this section, the key functions and operations that this index is
+designed to answer are explored.  For each, their efficiency is
+considered in terms of what must be carried out to calculate the
+answer.
+
+
+Determining Cache Suitability - rev_cache_suitable()
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This is an auxiliary function used by the other use cases, when 
+
+The function is:
+
+  rev_cache_suitable( rev_cache, object )
+
+The check is simple and fast - it just compares the object to the
+'start' object in the revision cache.
+
+
+Revision Walker
+~~~~~~~~~~~~~~~
+
+The revision walker is the main user of this cache; there is the
+generic function of revision walking, as well as commands that want
+specific information which they normally derive from the revision
+walker output.
+
+
+Setting up the walker
+^^^^^^^^^^^^^^^^^^^^^
+
+The functions for this are (intentionally resembling the current
+revision walker API):
+
+  rev_cache_init()
+  rev_cache_add( interesting?, oid )
+
+As well as this setup, it is necessary to specify which options are
+required.  Some API exists to do this, and the return value from
+'rev_cache_ok' is true if suitable caches were found for the
+information required, and false if it would require a graph traversal:
+
+  rev_cache_options( ordered? )
+  rev_cache_ok( ) : Bool
+
+The rev_cache_ok() function must open all available revision caches,
+and see if their 'interesting' object matches the single 'interesting'
+object passed to rev_cache_add().  If it matches, it returns true.  If
+multiple 'interesting' objects were specified and 'ordered' is true,
+then the function returns false.
+
+If the ordering flag was set to false, then all of the 'interesting'
+objects must be found in separate revision caches.
+
+If any 'uninteresting' objects were passed, then the return value is
+always false.
+
+
+Returning objects
+^^^^^^^^^^^^^^^^^
+
+The 'rev_cache_fetch()' iterator returns entries from the topological
+list.
+
+  rev_cache_fetch() : oid
+
+If returning objects from a single revision cache, it opens the 
+
+
+Accelerating in-progress walking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It is possible that during a revision iteration operation, a revision
+is discovered that may mean the rest of the revision walking can be
+achieved faster.
+
+In practice, this is likely to be implemented by making in-core cache
+entries for objects with revision caches prior to walking; then when
+encountered the special action can be taken.
+
+
+Creating Revision Caches
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Once the revision cache is setup via rev_cache_init() and
+an 'interesting' object added, then call:
+
+  rev_cache_create( )
+
+This function will not work, if multiple 'interesting' objects are
+passed, or any 'uninteresting' objects were passed.
+
+This function must revision walk the commit graph, sorting in
+--date-order along the way, and may emit revisions as they are
+discovered to the topological object list.
+
+
+receive-pack/pack-objects
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+pack-objects when called from pack-objects is a special user of the
+revision cache; it has the extra requirements of wanting to know;
+
+* how many objects are between 'interesting' and 'uninteresting'
+  objects, at the beginning of the run, to emit the pack header
+
+* for checking re-usability of deltas, whether the delta base object
+  in the pack is in the received set of objects.
+
+* for 'thin' pack generation, whether the delta base object is in the
+  received set of objects, -or- reachable from any 'uninteresting'
+  objects
+
+* for 'shallow' clone, whether the delta base object is reachable
+  without passing any of the 'uninteresting' objects
+
+The aim is for 'pack-objects' to be able to start returning objects
+immediately in the case where a suitable revision cache is returned,
+without waiting for revision counting or repacking.
+
+
+Returning object counts
+^^^^^^^^^^^^^^^^^^^^^^^
+
+If the revision cache was suitable, eg for a complete fetch, then the
+number of objects can be obtained by counting the size of the object
+list.
+
+If multiple revision caches were used for multiple 'interesting'
+objects, then the number of objects can be obtained by building a hash
+table of all the objects in all of the caches, and returning the
+number of hash table entries.
+
+If 'uninteresting' objects were passed, no cache can be suitable.
+
+
+Re-using deltas
+^^^^^^^^^^^^^^^
+
+This applies to the 'Compressing Objects' phase, previously known as
+the 'Deltafying Objects' phase.  Instead of searching for deltas, if
+we can be sure that the existing delta can be resolved on the remote,
+then we can re-use it.
+
+For the initial clone, this operation is simple - as objects are
+emitted, the delta from the packfile is re-used.  If a loop is
+detected or the delta base is not in the returned set of objects, then
+the delta is first resolved.
+
+This implies that the list of objects is first loaded into a hash
+table prior to returning any objects; however this is probably
+acceptable as the entire list is in one stream and will load quickly.
+
+For later fetches, the revision cache is not appropriate as they will
+have 'uninteresting' objects set.
+
+
+Re-using deltas - 'thin' pack
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This case is much like the normal re-using deltas case, except there
+is the complete set of objects that the remote has claimed to have to
+consider.
+
+The cache is not yet sophisticated enough to assist with this use
+case.
+
+
+'shallow' clone considerations
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For enumerating objects, the list of objects between the
+'uninteresting' and the 'shallow' commits must first be enumerated,
+and then subtracted from the objects between the 'interesting' and the
+'uninteresting' list.
+
+For re-using deltas for 'thin' packs, the list of objects between
+'uninteresting' and 'shallow' commits are enumerated and marked as
+valid for delta bases.
+
+The cache is not yet sophisticated enough to assist with this case.
+
+
+creating bundles
+~~~~~~~~~~~~~~~~
+
+So long as a bundle has no 'uninteresting' commits, then the revision
+cache is completely appropriate; with the length of the object list it
+can write a header and continue with the 'pack-objects' use case.
+
+
+slicing bundles (mirror-sync)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For 'slicing' bundles, a deterministic method is required for chopping
+up the list of objects, which can be calculated by any node which has
+the 'interesting' commits.
+
+To determine the objects which fall within a given slice, the object
+list must be enumerated and then divided evenly.  As the compressed
+size on one node cannot be reproduced on another node, the
+uncompressed size of the object is used instead, with the hope that
+slices will generally end up of roughly even size once compressed.
+
+To calculate the object boundaries, the list of objects in
+--date-order must first be tie-broken, eg for commits with the same
+commitdate and topological order.  If slices are allowed to be
+sub-commit level, then objects between commits (ie blobs and trees)
+are also sorted.  Once this deterministic list has been built, then
+all of the objects must be accessed to determine their length.  The
+objects which start within a given range are the ones in the slice.
+
+For re-using deltas in sliced bundles, the delta base is looked up in
+the deterministic list.  If it has an earlier sequence, then the delta
+can be safely re-used.  If it has a later sequence, then the delta
+must be resolved and then the object re-deltified/compressed, using
+only objects with an earlier sequence (or in the returned pack) as a
+base.
+
+This requirement is so that a collection of 'sliced' bundles can
+successfully re-assemble, while still allowing them to be 'thin'.
-- 
debian.1.5.6.1

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]