Re: [PATCH 1/5] revision caching documentation: man page and technical discussion

Sam Vilain <sam@xxxxxxxxxx> · Fri, 07 Aug 2009 15:08:58 +1200

Nick Edelen wrote:
> Before any code is introduced the full documentation is put forth.  This 
> provides a man page for the porcelain, and a technical doc in technical/.  The 
> latter describes the API, and discusses rev-cache's design, file format and 
> mechanics.
>
> Signed-off-by: Nick Edelen <sirnot@xxxxxxxxx>
>
> ---
>  Documentation/rev-cache.txt           |   51 +++++
>  Documentation/technical/rev-cache.txt |  336 +++++++++++++++++++++++++++++++++
>  2 files changed, 387 insertions(+), 0 deletions(-)
>   

A couple of minor nits here:

1. Documentation/rev-cache.txt should be
Documentation/git-rev-cache.txt, because it documents that command.

2. there are many whitespace errors:

wilber:~/src/git$ git am \[PATCH\ 1_5\]\ revision\ caching\
documentation\:\ man\ page\ and\ technical\ discussion.eml
Applying: revision caching documentation: man page and technical discussion
/home/samv/src/git/.git/rebase-apply/patch:15: trailing whitespace.
A front end for the rev-cache API is provided with the builtin utility
/home/samv/src/git/.git/rebase-apply/patch:16: trailing whitespace.
`rev-cache`.  It is mainly intended for cache slice generation and
maintenance,
/home/samv/src/git/.git/rebase-apply/patch:17: trailing whitespace.
but can also walk commits within a slice.  At the moment it is not
particularly
/home/samv/src/git/.git/rebase-apply/patch:34: trailing whitespace.
`\--stdin`:: Also read commit ids from stdin (seperated by newline,
`\--not`
/home/samv/src/git/.git/rebase-apply/patch:51: trailing whitespace.
`\--all`:: Include all objects in repository.  Will only traverse as of
cache
warning: squelched 166 whitespace errors
warning: 171 lines add whitespace errors.

Be sure to check the patch doesn't do that...

> diff --git a/Documentation/rev-cache.txt b/Documentation/rev-cache.txt
> new file mode 100755
> index 0000000..64bd051
> --- /dev/null
> +++ b/Documentation/rev-cache.txt
> @@ -0,0 +1,51 @@
> +rev-cache porcelain
> +===================
> +
> +A front end for the rev-cache API is provided with the builtin utility 
> +`rev-cache`.  It is mainly intended for cache slice generation and maintenance, 
> +but can also walk commits within a slice.  At the moment it is not particularly 
> +advanced, but is sufficient for repository administration.
>   

That last sentence is a bit unnecessarily self-doubting; "It currently
provides basic functionality" would be fine.

> +
> +It's general syntax is:
> +
> +`git-rev-cache COMMAND [options] [<commit-id>...]`
> +
> +With the commands:
>   

You should use the same structure and headings of this file as the other
man pages which are found under Documentation/git-*.txt

> +
> +`add`::
> +	Add revisions to the cache.  Reads commit ids from stdin, formatted as:
> +	`END END ... \--not START START ...`
>   

Really, it's reading a revision list; might be better to call it that. 
Can it accept the same options as git-rev-list here or just a list of
starting points/tips?  And "END" and "START" are still backwards to the
rest of git core!

> ++
> +Options:
> +
> +`\--all`:: Include all heads as ends.
> +`\--fresh`:: Exclude everything already in a cache slice.
>   

If you use the paragraph layout used by other commands you can explain a
little bit more about what you mean here.  By "Cache Slice" do you mean
"Revision Cache"?  Does that --fresh imply that all of the revisions
which were not indexed will automatically get indexed?

> +`\--stdin`:: Also read commit ids from stdin (seperated by newline, `\--not` 
> +also valid).
> +`\--legs`:: Ensure branch has no "dangling" starts (ie. is self-contained).
>   

The --legs term is a little too 'cute', is there a better way to
describe this?  And "starts" is not a real word in that context; if you
are using a technical term, you should define it..

> +`\--noobjects`:: Don't include non-commit objects.
> +
> +`walk`::
> +	Walk a cache slice based on set of commits; formatted as add.
>   

So, this works like 'rev-list' ?

"Formatted" is wrong there I think; do you mean it 'accepts the same
arguments as add' ?

> ++
> +Options:
> +
> +`\--objects`:: Include non-commit objects in traversal.
> +
>   

Which was the default?  --objects or --no-objects?

> +`fuse`::
> +	Coagulate several cache slices into a single large slice.
>   

Coagulate?  You mean, the revision caches will stop being liquid and go
gluggy, like a pool of blood clotting?

How about "combine" :-) - and the option might be better called
something simple like that, too.

> ++
> +Options:
> +
> +`\--all`:: Include all objects in repository.  Will only traverse as of cache 
> +ends if this is not specified.
>   

That sentence doesn't quite read well to me.

> +`\--noobjects`:: Don't include non-commit objects.
> +`\--ignore-size[=N]`:: Do not fuse slices of file size >= `N`.  If `N` is not 
> +given the cutoff size defaults to ~5MB.
> +
> +`index`::
> +	Regenerate the cache index.
> +
> +
> +For an explanation of the API and its inner workings, see 
> +link:technical/rev-cache.txt[technical info on rev-cache]
> diff --git a/Documentation/technical/rev-cache.txt b/Documentation/technical/rev-cache.txt
> new file mode 100755
> index 0000000..e95ec89
> --- /dev/null
> +++ b/Documentation/technical/rev-cache.txt
> @@ -0,0 +1,336 @@
> +rev-cache
> +=========
> +
> +The revision cache API ('rev-cache') provides a method for efficiently storing 
> +and accessing commit branch sections.  Such branch slices are defined by a 
> +series of top (interesting) and bottom (uninteresting) commits.  Each slice 
> +contains, per commit:
> +
> +* All intra-slice topological relations, encoded into path "channels" (see 
> +  'Mechanics' for full explanation).
> +* Object meta-data: type, SHA-1, size, date (for commits).
> +* Objects introduced in that commit, relative to slice (ie. only for non-start 
> +  commits).
> +
> +Storage data structures are not exported, in part to keep git's global scope 
> +clean, but largely because they're pretty much useless outside of rev-cache.
> +
> +The API
> +-------
> +
> +The API for 'rev-cache' is quite simple.  You can find the function prototypes 
> +in `revision.h`.
> +
> +Data Structures
> +~~~~~~~~~~~~~~~
> +
> +The `rev_cache_info` struct holds all the options and flags for the API.
> +
> +----
> +struct rev_cache_info {
> +        /* generation flags */
> +        unsigned objects : 1, 
> +                legs : 1, 
> +                make_index : 1;
> +        
> +        /* traversal flags */
> +        unsigned save_unique : 1, 
> +                add_to_pending : 1;
> +        
> +        /* fuse options */
> +        unsigned int ignore_size;
> +};
> +----
> +
> +The fields:
> +
> +`objects`:: Add non-commit objects to slice.
> +`legs`:: Ensure bottoms have no childrens.
>   

"children" is already plural and needs no 's'

> +`make_index`:: Integrate newly-made slice into index.
> +`save_unique`:: Load unique non-commit objects into `unique` field of each 
> +`commit` object.
>   

Unique how?  Do you mean, that they were introduced in this slice and
not reachable from any of the bottom/end commits?

> +`add_to_pending`:: Append unique non-commit objects to the `pending` object 
> +list in the passed `rev_info` instance.
>   

That doesn't make much sense to me either.  What does that mean?  Well,
I'll read on.

> +`ignore_size`:: If non-zero, ignore slices with size greater or equal to this.
>   

What will this ignoring mean?

> +Functions
> +~~~~~~~~~
> +
> +`init_rci`::
> +
> +        Initiate a `rev_cache_info` struct to default options.  
> +
> +`make_cache_slice`::
> +
> +        Create a cache based on an a `rev_info` instance or `commit_list` s of 
> +        "tops" and "bottoms" (defaulting to latter if `rev_info` pointer is 
> +        NULL), copying the cache SHA-1 into a passed pointer if non-zero.  A 
> +        `rev_cache_info` struct pointer can be passed to set options, while 
> +        passing NULL will set default options.  A last parameter can 
> +        optionally recieve the final cache hash.
> +
> +`make_cache_index`::
> +
> +        Add a slice to the cache index.  Requires a file descriptor, the cache 
> +        hash and the file size.  Note that this is normally called by 
> +        `make_cache_slice` under the `make_index` option.
> +
> +`get_cache_slice`::
> +
> +        Given a commit SHA-1 `get_cache_slice` will search the slice index and 
> +        return, if found, the cache-identifying SHA-1.
> +
> +`traverse_cache_slice`::
> +
> +        Traverse a specified cache slice based on:
> +
> +        * `rev_cache_info` instance (optional)
> +        * cache SHA-1 identifier
> +        * `rev_info` instance
> +        * a starting commit and commit work list
> +        * date of oldest encountered interesting commit
> +        * current `slop` (this and above mainly used in integration with 
> +          revision walker)
>   

Hmm what's a 'slop' ?

> +        
> ++
> +The output is sent to a FILO `commit_list` "queue", while any bottom commits 
> +are passed back into the work list.  If the walk is not contained within the 
> +slice, commit boundaries are also inserted into "work".
> +
> +`tops_from_slices`::
> +
> +        Will mark all top-commits in the specified cache slices with a given 
> +        flag, and add them to the rev pending list.  Will include all if no 
> +        slices are specified.
> +
> +`coagulate_cache_slices`::
> +
> +        Generate a slice based on the passed `rev_info` instance, replacing all 
> +        encountered slices with one (larger) slice.  The `ignore_size` field in 
> +        `rci`, if non-zero, will dictate which cache slice sizes to ignore in 
> +        both traversal and replacement.
> +
> +`regenerate_index`::
> +
> +        Remake cache index.
> +
> +Example Usage
> +-------------
> +
> +A few examples to demonstrate usage:
> +
> +.Creating a slice
> +----
> +/* pretend you're a porcelain for rev-cache reading from the command line */
> +struct rev_info revs;
> +struct rev_cache_info rci;
> +
> +init_revisions(&revs, 0);
> +init_rci(&rci);
> +
> +flags = 0;
> +for (i = 1; i < argc; i++) {
> +        if (!strcmp(argv[i], "--not"))
> +                flags ^= UNINTERESTING;
> +        else if(!strcmp(argv[i], "--fresh"))
> +                starts_from_slices(&revs, UNINTERESTING, 0, 0);
> +        else
> +                handle_revision_arg(argv[i], &revs, flags, 1);
> +}
> +
> +/* we want to explicitly set certain options */
> +rci.objects = 0;
> +
> +if (!make_cache_slice(&rci, &revs, 0, 0, cache_sha1))
> +        printf("made slice!  it's called %s\n", sha1_to_hex(cache_sha1));
>   

Heh, normally it's acceptable to let examples in documentation not be
complete working programs :)

> +----
> +
> +.Traversing a slice
> +----
> +/* let's say you're walking the tree with a 'work' list of current heads and a 
> + * FILO output list 'out' */
> +out = 0;
> +outp = &out;
> +
> +while (work) {
> +        struct commit *commit = pop_commit(&work);
> +        struct object *object = &commit->object;
> +        unsigned char *cache_sha1;
> +        
> +        if (cache_sha1 = get_cache_slice(object->sha1)) {
> +                /* note that this will instatiate any topo-relations 
> +                 * as it goes */
> +                if (traverse_cache_slice(&revs, cache_sha1, 
> +                        commit, 0, 0, /* use defaults */
> +                        &outp, &work) < 0)
> +                        die("I'm overreacting to a non-fatal cache error");
> +        } else {
> +                struct commit_list *parents = commit->parents;
> +                
> +                while (parents) {
> +                        struct commit *p = parents->item;
> +                        struct object *po = &p->object;
> +                        
> +                        parents = parents->next;
> +                        if (po->flags & UNINTERESTING)
> +                                continue;
> +                        
> +                        if (object->flags & UNINTERESTING)
> +                                po->flags |= UNINTERESTING;
> +                        else if (po->flags & SEEN)
> +                                continue;
> +                        
> +                        if (!po->parsed)
> +                                parse_commit(p);
> +                        insert_by_date(p, &work);
> +                }
> +                
> +                if (object->flags & (SEEN | UNINTERESTING) == 0)
> +                        outp = &commit_list_insert(commit, outp);
> +                object->flags |= SEEN;
> +        }
> +}
> +----
>   

Ok, good - perhaps needs a comment or two - not much, it's already long.

> +
> +Some Internals
> +--------------
> +
> +Although you really don't need to know anything about how rev-cache actually 
> +does its magic shizz, a bit of background may go a long way if you're wading 
> +through the source.
>   

"does its work" will do nicely..

> +
> +File Formats
> +~~~~~~~~~~~~
> +
> +A slice has a basic fixed-size header, followed by a certain number of object 
> +entries.  Commits are sorted in topo-order, and each commit entry is followed 
> +by the objects added in that commit.
>   

Starting from the top or the bottom?  And it might pay to clarify which
objects are included or not; ie objects reachable from "bottom" commits
or not.

> +
> +----
> +         -- +--------------------------------+
> +header      | object number, etc...          |
> +         -- +--------------------------------+
> +commit      | commit info                    |
> +entry       | path data                      |
> +            +--------------------------------+
> +            | tree/blob info                 |
> +            +--------------------------------+
> +            | tree/blob info                 |
> +            +--------------------------------+
> +            | ...                            |
> +         -- +--------------------------------+
> +commit      | commit info                    |
> +entry       | path data                      |
> +            +--------------------------------+
> +            | tree/blob info                 |
> +            +--------------------------------+
> +            | ...                            |
> +         -- +--------------------------------+
> +...         ...                               
> +            +--------------------------------+
> +----
> +
> +The index is somewhat similar to pack-file indexes, containing a fanout table 
> +and a list of index entries sorted by hash.
> +
> +----
> +         -- +--------------------------------+
> +header      | object #, cache #, etc.        |
> +         -- +--------------------------------+
> +cache       | SHA-1                          |
> +sha1s       | ...                            |
> +         -- +--------------------------------+
> +fanout      | fanout[0x00]                   |
> +table       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +            | fanout[0xff]                   |
> +         -- +--------------------------------+
> +index       | entry SHA-1                    |
> +entries     | cache sha1 index               |
> +            +--------------------------------+
> +            |                                |
> +            ...                               
> +            +--------------------------------+
> +----
> +
> +All the relavant structures are readily accessible in `rev-cache.c`
> +
>   

Ok right so which is the on-disk container format?  The index or the
slice?  I'm a little confused..

> +Mechanics
> +~~~~~~~~~
> +
> +The most important part of rev-cache is its method of encoding topological 
> +relations.  To ensure fluid traversal and reconstruction,

Is it still fluid after coagulation?  ;-)

>  commits are related 
> +through high-level "streams"/"channels" rather than individual 
> +interconnections.  Intuitively, rev-cache stores history the way gitk shows it: 
> +commits strung up on lines, which interconnect at merges and branches.
> +
> +Each commit is associated to a given channel/path via a 'path id', and 
> +variable-length fields govern which paths (if any) are closed or opened at that 
> +object.  This means that topo-data can be preserved in only a few bytes extra 
> +per object entry.  Other information stored per entry is the sha-1 hash, type, 
> +date, size, and status in cache slice.  Here is format of an object entry, both 
> +on-disk and in-memory:
> +
> +----
> +struct object_entry {
> +        unsigned type : 3;
> +        unsigned is_end : 1;
> +        unsigned is_start : 1;
> +        unsigned uninteresting : 1;
> +        unsigned include : 1;
> +        unsigned flags : 1;
> +        unsigned char sha1[20];
> +        
> +        unsigned merge_nr : 6;
> +        unsigned split_nr : 7;
> +        unsigned size_size : 3;
> +        
> +	uint32_t date;
> +	uint16_t path;
> +        
> +        /* merge paths */
> +        /* split paths */
> +        /* size */
> +};
> +----
> +
> +An explanation of each field:
> +
> +`type`:: Object type
> +`is_end`:: The commit has some parents outside the cache slice (all if slice 
> +has legs)
> +`is_start`:: The commit has no children in cache slice
> +`uninteresting`:: Run-time flag, used in traversal
> +`include`:: Run-time flag, used in traversal (initialization)
> +`flags`:: Currently unused, extra bit
> +`sha`:: Object SHA-1 hash
> +
> +`merge_nr`:: The number of paths the current channel diverges into; the current 
> +path ends upon any merge.
> +`split_nr`:: The number of paths this commit ends; used on both merging and 
> +branching.
> +`size_size`:: Number of bytes the object size takes up.
> +
> +merge paths:: The path IDs (16-bit) that are to be created.
> +split paths:: The path IDs (16-bit) that are to be ended.
> +size:: The size split into the minimum number of bytes.
> +
> +Each path ID refers to an index in a 'path array', which stores the current 
> +status (eg. active, interestingness) of each channel.
> +
> +Due to topo-relations and boundary tracking, all of a commit's parents must be 
> +encountered before the path is reallocated.  This is achieved by using a 
> +counter system per merge: starting at the parent number, the counter is 
> +decremented as each parent is encountered (dictated by 'split paths'); at 0 the 
> +path is cleared.
> +
> +Boundary tracking is necessary because non-commits are stored relative to the 
> +commit in which they were introduced.  If a series of commits is not included 
> +in the output, the last interesting commit must be parsed manually to ensure 
> +all objects are accounted for.
> +
> +To prevent list-objects from recursing into trees that we've already taken care 
> +of, the flag `FACE_VALUE` is introduced.  An object with this flag is not 
> +explored (= "taken at face value"), significantly reducing I/O and processing 
> +time.
> +
> +(NSE)
>   

tl;dr but that's ok it's magic shizz.

Anyway looking nice ... see what I can say about the next patches.

sam
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html