Before any code is introduced the full documentation is put forth. This provides a man page for the porcelain, and a technical doc in technical/. The latter describes the API, and discusses rev-cache's design, file format and mechanics. Signed-off-by: Nick Edelen <sirnot@xxxxxxxxx> --- Documentation/rev-cache.txt | 51 +++++ Documentation/technical/rev-cache.txt | 336 +++++++++++++++++++++++++++++++++ 2 files changed, 387 insertions(+), 0 deletions(-) diff --git a/Documentation/rev-cache.txt b/Documentation/rev-cache.txt new file mode 100755 index 0000000..64bd051 --- /dev/null +++ b/Documentation/rev-cache.txt @@ -0,0 +1,51 @@ +rev-cache porcelain +=================== + +A front end for the rev-cache API is provided with the builtin utility +`rev-cache`. It is mainly intended for cache slice generation and maintenance, +but can also walk commits within a slice. At the moment it is not particularly +advanced, but is sufficient for repository administration. + +It's general syntax is: + +`git-rev-cache COMMAND [options] [<commit-id>...]` + +With the commands: + +`add`:: + Add revisions to the cache. Reads commit ids from stdin, formatted as: + `END END ... \--not START START ...` ++ +Options: + +`\--all`:: Include all heads as ends. +`\--fresh`:: Exclude everything already in a cache slice. +`\--stdin`:: Also read commit ids from stdin (seperated by newline, `\--not` +also valid). +`\--legs`:: Ensure branch has no "dangling" starts (ie. is self-contained). +`\--noobjects`:: Don't include non-commit objects. + +`walk`:: + Walk a cache slice based on set of commits; formatted as add. ++ +Options: + +`\--objects`:: Include non-commit objects in traversal. + +`fuse`:: + Coagulate several cache slices into a single large slice. ++ +Options: + +`\--all`:: Include all objects in repository. Will only traverse as of cache +ends if this is not specified. +`\--noobjects`:: Don't include non-commit objects. +`\--ignore-size[=N]`:: Do not fuse slices of file size >= `N`. If `N` is not +given the cutoff size defaults to ~5MB. + +`index`:: + Regenerate the cache index. + + +For an explanation of the API and its inner workings, see +link:technical/rev-cache.txt[technical info on rev-cache] diff --git a/Documentation/technical/rev-cache.txt b/Documentation/technical/rev-cache.txt new file mode 100755 index 0000000..e95ec89 --- /dev/null +++ b/Documentation/technical/rev-cache.txt @@ -0,0 +1,336 @@ +rev-cache +========= + +The revision cache API ('rev-cache') provides a method for efficiently storing +and accessing commit branch sections. Such branch slices are defined by a +series of top (interesting) and bottom (uninteresting) commits. Each slice +contains, per commit: + +* All intra-slice topological relations, encoded into path "channels" (see + 'Mechanics' for full explanation). +* Object meta-data: type, SHA-1, size, date (for commits). +* Objects introduced in that commit, relative to slice (ie. only for non-start + commits). + +Storage data structures are not exported, in part to keep git's global scope +clean, but largely because they're pretty much useless outside of rev-cache. + +The API +------- + +The API for 'rev-cache' is quite simple. You can find the function prototypes +in `revision.h`. + +Data Structures +~~~~~~~~~~~~~~~ + +The `rev_cache_info` struct holds all the options and flags for the API. + +---- +struct rev_cache_info { + /* generation flags */ + unsigned objects : 1, + legs : 1, + make_index : 1; + + /* traversal flags */ + unsigned save_unique : 1, + add_to_pending : 1; + + /* fuse options */ + unsigned int ignore_size; +}; +---- + +The fields: + +`objects`:: Add non-commit objects to slice. +`legs`:: Ensure bottoms have no childrens. +`make_index`:: Integrate newly-made slice into index. +`save_unique`:: Load unique non-commit objects into `unique` field of each +`commit` object. +`add_to_pending`:: Append unique non-commit objects to the `pending` object +list in the passed `rev_info` instance. +`ignore_size`:: If non-zero, ignore slices with size greater or equal to this. + +Functions +~~~~~~~~~ + +`init_rci`:: + + Initiate a `rev_cache_info` struct to default options. + +`make_cache_slice`:: + + Create a cache based on an a `rev_info` instance or `commit_list` s of + "tops" and "bottoms" (defaulting to latter if `rev_info` pointer is + NULL), copying the cache SHA-1 into a passed pointer if non-zero. A + `rev_cache_info` struct pointer can be passed to set options, while + passing NULL will set default options. A last parameter can + optionally recieve the final cache hash. + +`make_cache_index`:: + + Add a slice to the cache index. Requires a file descriptor, the cache + hash and the file size. Note that this is normally called by + `make_cache_slice` under the `make_index` option. + +`get_cache_slice`:: + + Given a commit SHA-1 `get_cache_slice` will search the slice index and + return, if found, the cache-identifying SHA-1. + +`traverse_cache_slice`:: + + Traverse a specified cache slice based on: + + * `rev_cache_info` instance (optional) + * cache SHA-1 identifier + * `rev_info` instance + * a starting commit and commit work list + * date of oldest encountered interesting commit + * current `slop` (this and above mainly used in integration with + revision walker) + ++ +The output is sent to a FILO `commit_list` "queue", while any bottom commits +are passed back into the work list. If the walk is not contained within the +slice, commit boundaries are also inserted into "work". + +`tops_from_slices`:: + + Will mark all top-commits in the specified cache slices with a given + flag, and add them to the rev pending list. Will include all if no + slices are specified. + +`coagulate_cache_slices`:: + + Generate a slice based on the passed `rev_info` instance, replacing all + encountered slices with one (larger) slice. The `ignore_size` field in + `rci`, if non-zero, will dictate which cache slice sizes to ignore in + both traversal and replacement. + +`regenerate_index`:: + + Remake cache index. + +Example Usage +------------- + +A few examples to demonstrate usage: + +.Creating a slice +---- +/* pretend you're a porcelain for rev-cache reading from the command line */ +struct rev_info revs; +struct rev_cache_info rci; + +init_revisions(&revs, 0); +init_rci(&rci); + +flags = 0; +for (i = 1; i < argc; i++) { + if (!strcmp(argv[i], "--not")) + flags ^= UNINTERESTING; + else if(!strcmp(argv[i], "--fresh")) + starts_from_slices(&revs, UNINTERESTING, 0, 0); + else + handle_revision_arg(argv[i], &revs, flags, 1); +} + +/* we want to explicitly set certain options */ +rci.objects = 0; + +if (!make_cache_slice(&rci, &revs, 0, 0, cache_sha1)) + printf("made slice! it's called %s\n", sha1_to_hex(cache_sha1)); +---- + +.Traversing a slice +---- +/* let's say you're walking the tree with a 'work' list of current heads and a + * FILO output list 'out' */ +out = 0; +outp = &out; + +while (work) { + struct commit *commit = pop_commit(&work); + struct object *object = &commit->object; + unsigned char *cache_sha1; + + if (cache_sha1 = get_cache_slice(object->sha1)) { + /* note that this will instatiate any topo-relations + * as it goes */ + if (traverse_cache_slice(&revs, cache_sha1, + commit, 0, 0, /* use defaults */ + &outp, &work) < 0) + die("I'm overreacting to a non-fatal cache error"); + } else { + struct commit_list *parents = commit->parents; + + while (parents) { + struct commit *p = parents->item; + struct object *po = &p->object; + + parents = parents->next; + if (po->flags & UNINTERESTING) + continue; + + if (object->flags & UNINTERESTING) + po->flags |= UNINTERESTING; + else if (po->flags & SEEN) + continue; + + if (!po->parsed) + parse_commit(p); + insert_by_date(p, &work); + } + + if (object->flags & (SEEN | UNINTERESTING) == 0) + outp = &commit_list_insert(commit, outp); + object->flags |= SEEN; + } +} +---- + +Some Internals +-------------- + +Although you really don't need to know anything about how rev-cache actually +does its magic shizz, a bit of background may go a long way if you're wading +through the source. + +File Formats +~~~~~~~~~~~~ + +A slice has a basic fixed-size header, followed by a certain number of object +entries. Commits are sorted in topo-order, and each commit entry is followed +by the objects added in that commit. + +---- + -- +--------------------------------+ +header | object number, etc... | + -- +--------------------------------+ +commit | commit info | +entry | path data | + +--------------------------------+ + | tree/blob info | + +--------------------------------+ + | tree/blob info | + +--------------------------------+ + | ... | + -- +--------------------------------+ +commit | commit info | +entry | path data | + +--------------------------------+ + | tree/blob info | + +--------------------------------+ + | ... | + -- +--------------------------------+ +... ... + +--------------------------------+ +---- + +The index is somewhat similar to pack-file indexes, containing a fanout table +and a list of index entries sorted by hash. + +---- + -- +--------------------------------+ +header | object #, cache #, etc. | + -- +--------------------------------+ +cache | SHA-1 | +sha1s | ... | + -- +--------------------------------+ +fanout | fanout[0x00] | +table ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + | fanout[0xff] | + -- +--------------------------------+ +index | entry SHA-1 | +entries | cache sha1 index | + +--------------------------------+ + | | + ... + +--------------------------------+ +---- + +All the relavant structures are readily accessible in `rev-cache.c` + +Mechanics +~~~~~~~~~ + +The most important part of rev-cache is its method of encoding topological +relations. To ensure fluid traversal and reconstruction, commits are related +through high-level "streams"/"channels" rather than individual +interconnections. Intuitively, rev-cache stores history the way gitk shows it: +commits strung up on lines, which interconnect at merges and branches. + +Each commit is associated to a given channel/path via a 'path id', and +variable-length fields govern which paths (if any) are closed or opened at that +object. This means that topo-data can be preserved in only a few bytes extra +per object entry. Other information stored per entry is the sha-1 hash, type, +date, size, and status in cache slice. Here is format of an object entry, both +on-disk and in-memory: + +---- +struct object_entry { + unsigned type : 3; + unsigned is_end : 1; + unsigned is_start : 1; + unsigned uninteresting : 1; + unsigned include : 1; + unsigned flags : 1; + unsigned char sha1[20]; + + unsigned merge_nr : 6; + unsigned split_nr : 7; + unsigned size_size : 3; + + uint32_t date; + uint16_t path; + + /* merge paths */ + /* split paths */ + /* size */ +}; +---- + +An explanation of each field: + +`type`:: Object type +`is_end`:: The commit has some parents outside the cache slice (all if slice +has legs) +`is_start`:: The commit has no children in cache slice +`uninteresting`:: Run-time flag, used in traversal +`include`:: Run-time flag, used in traversal (initialization) +`flags`:: Currently unused, extra bit +`sha`:: Object SHA-1 hash + +`merge_nr`:: The number of paths the current channel diverges into; the current +path ends upon any merge. +`split_nr`:: The number of paths this commit ends; used on both merging and +branching. +`size_size`:: Number of bytes the object size takes up. + +merge paths:: The path IDs (16-bit) that are to be created. +split paths:: The path IDs (16-bit) that are to be ended. +size:: The size split into the minimum number of bytes. + +Each path ID refers to an index in a 'path array', which stores the current +status (eg. active, interestingness) of each channel. + +Due to topo-relations and boundary tracking, all of a commit's parents must be +encountered before the path is reallocated. This is achieved by using a +counter system per merge: starting at the parent number, the counter is +decremented as each parent is encountered (dictated by 'split paths'); at 0 the +path is cleared. + +Boundary tracking is necessary because non-commits are stored relative to the +commit in which they were introduced. If a series of commits is not included +in the output, the last interesting commit must be parsed manually to ensure +all objects are accounted for. + +To prevent list-objects from recursing into trees that we've already taken care +of, the flag `FACE_VALUE` is introduced. An object with this flag is not +explored (= "taken at face value"), significantly reducing I/O and processing +time. + +(NSE) -- -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html